Detecting objects requires not only classifying image regions but also precisely localizing them with bounding boxes. Early methods often relied on computationally expensive sliding window approaches. Modern detectors, particularly single-stage ones like YOLO and SSD, and the proposal stage of two-stage detectors like Faster R-CNN, employ a more efficient concept: anchor boxes (also sometimes called default boxes or priors). These predefined boxes provide reference templates that the network learns to refine.
Think of anchor boxes as a set of initial guesses for bounding boxes, strategically placed across the image at different locations and with varying sizes and shapes. Instead of predicting box coordinates from scratch, the network learns to predict:
These anchor boxes are typically defined relative to the cells of the output feature map of a convolutional backbone. For a feature map of size W×H with C channels, each of the W×H spatial locations (cells) is associated with a set of k anchor boxes. Each anchor box has a predefined scale (size) and aspect ratio (width-to-height ratio).
Anchor boxes with different scales and aspect ratios associated with a single spatial location on a feature map, projected onto the corresponding region in the input image.
The effectiveness of an anchor-based detector heavily depends on the appropriate selection of anchor box scales and aspect ratios. These choices directly influence the model's ability to detect objects of various sizes and shapes. Common strategies include:
The number of anchors (k) per location is also a design choice. Using more anchors increases coverage and potentially recall, especially for unusually shaped objects, but also increases computational cost and the number of predictions the network must make.
During training, each anchor box needs to be labeled as either containing an object (positive) or background (negative). This assignment is typically done using IoU between the anchor boxes and the ground truth bounding boxes.
A common matching strategy involves:
For anchors matched positively to a ground truth box, the network learns to predict refinement offsets. Instead of predicting absolute coordinates (x,y,w,h), the network predicts four delta values (tx,ty,tw,th) relative to the anchor box's properties (ax,ay,aw,ah).
A widely used parameterization (similar to Faster R-CNN) is:
tx=(xgt−ax)/aw ty=(ygt−ay)/ah tw=log(wgt/aw) th=log(hgt/ah)Here, (xgt,ygt,wgt,hgt) are the center coordinates, width, and height of the ground truth box, and (ax,ay,aw,ah) are the corresponding properties of the anchor box. The network is trained using a regression loss (like Smooth L1 loss) to minimize the difference between its predicted t values and the target t values calculated from the matched ground truth box.
At inference time, the network predicts (tx,ty,tw,th). These predictions are then used to transform the initial anchor box (ax,ay,aw,ah) into the final predicted bounding box (px,py,pw,ph):
px=tx⋅aw+ax py=ty⋅ah+ay pw=exp(tw)⋅aw ph=exp(th)⋅ahThis regression mechanism allows the network to precisely adjust the size and position of the predefined anchors to match the detected objects accurately. The final set of refined boxes, along with their confidence and class scores, are then typically processed using Non-Maximum Suppression (NMS) to eliminate redundant detections, which we discuss in the next section.
Choosing the right anchor configuration involves trade-offs between detection performance across different object scales/ratios and computational efficiency. While anchor boxes have been foundational to many successful object detectors, ongoing research also explores anchor-free methods that aim to predict object locations more directly, removing the need for predefined anchor sets and their associated hyperparameters. However, understanding anchor boxes remains important for working with many state-of-the-art detection models.
© 2025 ApX Machine Learning