While two-stage detectors like Faster R-CNN achieve high accuracy by first proposing regions of interest and then classifying them, this sequential process can be computationally intensive. Single-stage detectors offer an alternative approach, prioritizing speed by performing localization and classification in a single forward pass of the network. The most influential family in this category is YOLO, or "You Only Look Once."
YOLO frames object detection fundamentally differently from region proposal-based methods. Instead of identifying potential object regions first, YOLO treats detection as a regression problem, directly predicting bounding boxes and class probabilities from the entire image in one evaluation.
The core idea involves dividing the input image into an S×S grid. Each grid cell is responsible for detecting objects whose center falls within that cell. For each grid cell, the network predicts:
Bounding Boxes: A fixed number (B) of bounding boxes. Each bounding box prediction consists of 5 values:
Class Probabilities: Conditional class probabilities, Pr(Classi∣Object), for each of the C classes. This probability is conditioned on an object being present in the grid cell.
This entire process happens within a single convolutional network. The network takes the image as input and outputs a 3D tensor of shape S×S×(B×5+C). This tensor encodes all the predictions for all grid cells.
The YOLO model divides the input image into a grid. A single CNN processes the image, and for each grid cell (like the highlighted blue cell), it predicts multiple bounding boxes and class probabilities, encoded in an output tensor.
The confidence score for each predicted box (boxj) within a grid cell is formally defined as:
Confidencej=Pr(Object)×IOU(predj,truth)Here, Pr(Object) is the probability that an object center is present in the grid cell associated with the prediction, and IOU(predj,truth) is the Intersection over Union between the predicted box (predj) and the ground truth box. If no object exists in that cell, Pr(Object) should be zero, making the confidence score zero.
During inference, the final class-specific confidence score for each bounding box is calculated by multiplying the box confidence score with the conditional class probability:
Pr(Classi∣Object)×Pr(Object)×IOU(predj,truth)=Pr(Classi)×IOU(predj,truth)This score tells us the probability that a specific class is present in the box and how well the predicted box fits the object. Boxes below a certain threshold are discarded, and Non-Maximum Suppression (NMS) is applied to remove redundant overlapping boxes for the same object.
The original YOLO (YOLOv1) was remarkably fast but had limitations, particularly in detecting small objects close together (as each grid cell predicted only a limited number of boxes and only one set of class probabilities) and achieving precise localization compared to two-stage methods. Subsequent versions introduced significant improvements:
YOLOv2 (YOLO9000): Introduced anchor boxes, similar to those used in Faster R-CNN and SSD. Instead of directly predicting bounding box dimensions, the network predicts offsets relative to pre-defined anchor box shapes. This made it easier for the network to learn to predict common object shapes and improved recall. YOLOv2 also used higher resolution input, incorporated Batch Normalization, and employed a new backbone network (Darknet-19). It was notably trained jointly on detection (COCO) and classification (ImageNet) datasets, enabling it to detect object categories it hadn't seen labeled bounding boxes for.
YOLOv3: Further refined the approach by introducing multi-scale predictions. It makes predictions at three different spatial resolutions (downsampled by 32, 16, and 8) using feature maps from different stages of the backbone network (now Darknet-53). This significantly improved the detection of small objects. YOLOv3 also switched from using a softmax for class prediction within a box to using independent logistic classifiers for each class, allowing for multi-label predictions (e.g., an object being both "Person" and "Woman").
YOLOv4, YOLOv5, and Beyond: Development continued rapidly, focusing on optimizing the balance between speed and accuracy. These versions often incorporated a collection of architectural and training enhancements:
It's important to note that "YOLOv5" and subsequent versions (like YOLOv6, v7, v8, YOLO-NAS, etc.) represent a lineage often associated with specific codebases (like Ultralytics) rather than single research papers defining each version definitively, leading to a more fragmented but rapidly evolving landscape. The fundamental principle of single-stage, grid-based regression remains consistent.
Strengths:
Weaknesses:
Numerous pre-trained YOLO models are available across various deep learning frameworks (PyTorch, TensorFlow/Keras). These models, trained on large datasets like COCO, provide excellent starting points. Fine-tuning a pre-trained YOLO model on a custom dataset is a common practice for adapting it to specific detection tasks. Libraries like Ultralytics YOLO or frameworks like MMDetection offer streamlined tools for training and deploying YOLO-based detectors.
In summary, the YOLO family represents a significant branch in the evolution of object detection, trading a small amount of localization accuracy (compared to the best two-stage models) for substantial gains in processing speed. Its direct regression approach and continuous improvements have made it a workhorse for real-time computer vision applications.
© 2025 ApX Machine Learning