Recall from the chapter introduction that two-stage object detectors, like the R-CNN family, first propose candidate regions in an image that might contain objects and then classify these regions. Early methods like R-CNN and Fast R-CNN relied on external algorithms, such as Selective Search, to generate these region proposals. While effective, this external step was often computationally expensive and became a bottleneck, preventing true end-to-end training and inference.
Faster R-CNN introduced a significant innovation to address this bottleneck: the Region Proposal Network (RPN). The RPN is a fully convolutional network designed to efficiently predict object proposals directly from the convolutional feature maps generated by the main backbone network (like a VGG or ResNet). This integration allows the proposal generation step to share convolutional features with the downstream detection network, making it much faster.
Imagine the deep convolutional feature map produced by the backbone network after processing an input image. This feature map retains spatial information, although at a reduced resolution compared to the original image. The RPN operates by sliding a small convolutional network (often using a 3x3 kernel) over this feature map.
At each sliding window location on the feature map, the RPN performs two tasks simultaneously:
Instead of predicting bounding boxes from scratch, which is a complex task, the RPN uses a set of predefined reference boxes called anchor boxes (or simply anchors). At each location where the RPN's sliding window operates, it considers multiple anchors, typically varying in scale (size) and aspect ratio (width-to-height ratio). For example, a common configuration uses 9 anchors per location: 3 scales combined with 3 aspect ratios (e.g., 1:1, 1:2, 2:1).
These anchors serve as initial guesses or priors for potential object locations and shapes. The RPN's job is not to invent boxes but to classify each predefined anchor as "object" or "background" and to slightly adjust the coordinates of promising anchors.
Flow within Faster R-CNN. The RPN takes features from the backbone, uses anchors, and generates proposals via classification and regression layers. These proposals, along with backbone features, feed into the final detection head.
For each of the k anchors at every sliding window position, the RPN has two sibling output layers:
During training, anchors are labeled based on their Intersection over Union (IoU) with ground-truth object boxes. An anchor might be labeled as:
The RPN is then trained end-to-end using a multi-task loss function, which combines:
After the RPN processes the feature map, we obtain a large number of potential object proposals, each with an objectness score and refined coordinates. Many of these proposals will be highly overlapping and redundant. To prune this set, two steps are typically applied:
The remaining proposals (typically a few hundred or thousand per image, e.g., ~300 for Faster R-CNN post-NMS) are then passed, along with the original feature maps, to the second stage of the Faster R-CNN detector (often referred to as the RoI Head or Fast R-CNN detector). This second stage performs RoIPooling (or RoIAlign) on the features corresponding to each proposal and then uses final classification and regression layers to assign a specific class label (e.g., "car," "person," "dog") and further refine the bounding box coordinates.
The core advantage of the RPN lies in its computational efficiency. By sharing the expensive convolutional feature extraction layers with the final detection network, the cost of generating region proposals becomes almost negligible compared to the older methods relying on separate algorithms like Selective Search. This integration allows Faster R-CNN to achieve significantly higher speeds, enabling near real-time object detection while maintaining high accuracy, and facilitates true end-to-end optimization of the entire detection pipeline. Understanding RPNs is fundamental to grasping the architecture of many modern and effective two-stage object detectors.
© 2025 ApX Machine Learning