While two-stage detectors like Faster R-CNN achieve high accuracy by first proposing regions and then classifying them, this sequential process can introduce computational overhead. Single-stage detectors streamline this process by performing localization and classification simultaneously, directly predicting bounding boxes and class probabilities from feature maps in a single pass through the network. This architectural choice generally leads to faster inference times, making them suitable for real-time applications. We'll examine two influential single-stage detectors: the Single Shot MultiBox Detector (SSD) and RetinaNet.
SSD tackles the challenge of detecting objects at various scales by making predictions from multiple feature maps at different resolutions within a single forward pass of a base network (like VGG or ResNet). Earlier feature maps (closer to the input) have higher spatial resolution and capture finer details, making them suitable for detecting smaller objects. Later feature maps have lower resolution but larger receptive fields, enabling them to detect larger objects.
Architecture and Multi-Scale Feature Maps: SSD starts with a standard classification network (the backbone), truncated before the final classification layers. It then appends several auxiliary convolutional layers that progressively decrease in spatial resolution while increasing the channel depth. Unlike models that only predict from the final feature layer, SSD generates predictions from selected feature maps at various stages of this backbone and auxiliary structure.
For each selected feature map, a set of convolutional filters predicts:
Default Boxes (Anchors): At each location on a selected feature map, SSD associates a set of default boxes with different aspect ratios and scales. These default boxes tile the feature map and serve as initial proposals. The network predicts offsets (Δcx,Δcy,Δw,Δh) to adjust the position and size of these default boxes to better fit the ground truth object, along with the confidence scores for each class. The scales of default boxes are typically smaller for higher-resolution feature maps and larger for lower-resolution maps, aligning box sizes with the expected object sizes at that feature level.
Training: During training, each ground truth bounding box is matched to the default box with the highest Jaccard overlap (Intersection over Union, IoU). Default boxes with an IoU greater than a threshold (e.g., 0.5) with any ground truth box are also considered positive matches. All other default boxes are marked as negative (background). The loss function is a weighted sum of:
SSD offers a good balance between speed and accuracy, often significantly faster than two-stage detectors. However, because it uses relatively lower-resolution feature maps for detecting larger objects and relies heavily on these pre-defined default boxes across multiple scales, it sometimes struggles with detecting very small objects accurately compared to methods that analyze finer details more thoroughly or have dedicated proposal stages.
A primary challenge for dense, single-stage detectors like SSD is the extreme class imbalance during training. The vast majority of default boxes or anchor locations correspond to the background class, while only a small fraction represent actual objects. Standard cross-entropy loss applied to all locations means that the easily classified background examples can collectively dominate the loss value and gradient updates, hindering the network's ability to learn effective representations for the rarer foreground object classes.
RetinaNet introduced the Focal Loss specifically to address this imbalance. It's a dynamically scaled cross-entropy loss where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, Focal Loss automatically down-weights the contribution of easy examples (typically abundant background negatives) during training and focuses the model's attention on hard-to-classify examples (often foreground objects or ambiguous background patches).
The standard Cross-Entropy (CE) loss for binary classification can be written as: CE(p,y)={−log(p)−log(1−p)if y=1if y=0 Where y∈{0,1} is the ground truth class and p∈[0,1] is the model's estimated probability for the class y=1. We can rewrite this more compactly as CE(pt)=−log(pt), where pt is defined as: pt={p1−pif y=1if y=0 pt represents the probability of the ground truth class.
The Focal Loss adds a modulating factor (1−pt)γ to the standard cross-entropy loss, with a tunable focusing parameter γ≥0: FL(pt)=−(1−pt)γlog(pt) Optionally, an α-balancing parameter can also be added: FL(pt)=−αt(1−pt)γlog(pt) Where αt is α for class 1 and 1−α for class 0.
Properties of Focal Loss:
The plot illustrates how Focal Loss (FL) with γ>0 reduces the loss for well-classified examples (high pt) compared to standard Cross-Entropy (CE, equivalent to γ=0). Higher values of γ increase this effect, focusing training on harder examples where pt is low.
RetinaNet Architecture: While Focal Loss is the main contribution, the RetinaNet detector itself typically employs a Feature Pyramid Network (FPN) built on top of a backbone like ResNet. FPN generates a multi-scale feature pyramid with rich semantics at all levels, improving the detection of objects across a wide range of scales, complementing the effect of Focal Loss. Anchor boxes are applied to each level of the feature pyramid for prediction.
By effectively managing class imbalance with Focal Loss, RetinaNet demonstrated that single-stage detectors could achieve accuracy comparable to or even exceeding popular two-stage detectors like Faster R-CNN, while maintaining higher speeds.
Both SSD and RetinaNet exemplify the single-stage approach, directly predicting boxes and classes without a dedicated region proposal step.
Choosing between these (and other detectors like YOLO) often involves considering the specific application's requirements regarding speed, accuracy, object size distribution, and scene complexity. RetinaNet generally provides higher accuracy than SSD, especially in challenging scenarios, thanks to Focal Loss and often the use of FPN, while SSD might be preferred if maximum speed is the absolute priority and its accuracy trade-offs are acceptable.
© 2025 ApX Machine Learning