As introduced earlier, object detection requires not just classifying an image but locating and identifying potentially multiple objects within it. One major family of approaches tackles this in two distinct steps: first, propose candidate regions in the image that might contain an object, and second, classify the object (if any) within each proposed region and refine its bounding box. These are known as two-stage detectors. The R-CNN family represents the pioneering and highly influential lineage of this approach.
R-CNN: Regions with CNN Features
The original R-CNN (Regions with Convolutional Neural Network features) paper marked a significant improvement in object detection accuracy by successfully applying deep CNNs, which were already excelling at image classification. However, adapting CNN classifiers, which typically operate on fixed-size inputs, to detect arbitrarily sized objects in various locations was non-trivial. R-CNN proposed a multi-step pipeline:
- Region Proposal: Instead of analyzing every possible window, R-CNN first generated a manageable set of candidate object regions. It used an external algorithm like Selective Search, which produces around 2000 class-agnostic region proposals (bounding boxes) per image based on low-level features like color, texture, and containment.
- Feature Extraction: Each proposed region was warped (anisotropically scaled) to the fixed input size required by a pre-trained CNN (like AlexNet). The warped region was then fed through the CNN to extract a fixed-length feature vector (e.g., from the layer before the final classifier). This step was the computational bottleneck. Since proposals often overlap substantially, the same image pixels were processed multiple times through the expensive CNN.
- Classification: For each class (plus a background category), a separate linear Support Vector Machine (SVM) classifier was trained using the extracted CNN features as input. Given the features for a region proposal, the SVMs would predict the class of the object within that region.
- Bounding Box Regression: To improve localization accuracy, R-CNN also trained class-specific linear regression models. These models learned to predict offsets to refine the coordinates of a proposed bounding box, given the CNN features for that region.
R-CNN pipeline overview. Note the separate stages and the repeated CNN feature extraction for each warped region.
While R-CNN significantly advanced the state of the art, its pipeline had major drawbacks:
- Training: It involved multiple independent stages: fine-tuning the CNN on warped regions, training one SVM per class, and training bounding box regressors. This was complex and non-optimal.
- Speed: Inference was extremely slow (around 40-50 seconds per image on GPUs of the time) primarily because the CNN had to be run independently on ~2000 warped region proposals per image, leading to massive redundant computations.
- Storage: Extracted features for all proposals needed to be stored, requiring significant disk space.
Fast R-CNN: Sharing Computation
Addressing the speed bottleneck of R-CNN was the primary motivation for Fast R-CNN. The key insight was that the redundant CNN computations on overlapping regions could be avoided by sharing computation.
Instead of running the CNN on each warped region proposal individually, Fast R-CNN works as follows:
- Full Image Feature Map: The entire input image is passed through the backbone CNN once to generate a convolutional feature map.
- Project Proposals: The region proposals (still generated externally, e.g., by Selective Search) are projected onto this shared convolutional feature map.
- RoI Pooling: A novel layer called Region of Interest (RoI) Pooling was introduced. For each projected region proposal (which now corresponds to a rectangular area on the feature map), RoI Pooling extracts a small, fixed-size feature map (e.g., 7x7). It does this by dividing the proposal region on the feature map into a grid of fixed size (e.g., 7x7 sub-windows) and max-pooling the features within each sub-window. This elegantly handles the variable sizes of region proposals while producing a fixed-size output suitable for subsequent fully connected layers.
- Unified Head: The fixed-size feature map from RoI pooling is fed into a sequence of fully connected layers. Finally, this branches into two sibling output layers:
- A softmax layer that outputs class probabilities (over K object classes + 1 background class).
- A bounding box regression layer that outputs refined box coordinates (typically 4 values per object class).
Fast R-CNN architecture. Feature extraction happens once. RoI Pooling bridges the gap between variable-sized proposals and fixed-size inputs for the classification/regression head.
Fast R-CNN offered substantial advantages over R-CNN:
- Speed: It was significantly faster during both training and inference (around 9x faster training, 200x faster inference) because the bulk of the computation (the CNN backbone) is shared across all proposals.
- End-to-End Training (mostly): The network, including the classification and bounding box regression layers, could be trained jointly in a single stage using a multi-task loss (combining classification loss and regression loss), simplifying the training process compared to R-CNN's multi-stage approach.
- Accuracy: The joint training generally led to improved accuracy.
However, Fast R-CNN still relied on an external, often slow, region proposal method like Selective Search, which became the new computational bottleneck during inference.
Faster R-CNN: Towards End-to-End Detection
Faster R-CNN addressed the final bottleneck of Fast R-CNN by integrating the region proposal mechanism into the deep network itself. This was achieved by introducing the Region Proposal Network (RPN).
The RPN is a small, fully convolutional network that takes the convolutional feature map (produced by the shared backbone CNN) as input and outputs a set of rectangular object proposals, each with an associated "objectness" score (probability of containing any object vs. background).
Here's how Faster R-CNN works:
- Shared Backbone: As in Fast R-CNN, the input image is processed by a backbone CNN (e.g., VGG, ResNet) to produce a deep convolutional feature map.
- Region Proposal Network (RPN):
- This network slides a small n×n spatial window (e.g., 3×3) over the shared convolutional feature map.
- At each sliding-window location, it considers multiple potential proposals simultaneously. These are generated relative to predefined anchor boxes (or simply "anchors"). Anchors are reference boxes centered at the sliding window position, typically having multiple scales and aspect ratios (e.g., 3 scales x 3 aspect ratios = 9 anchors per location).
- For each anchor, the RPN outputs two predictions via sibling fully connected layers (implemented as 1×1 convolutions):
- Objectness Score: 2 scores representing the probability that the anchor contains an object or background.
- Box Refinements: 4 values representing parameterized coordinate adjustments (offsets for center x,y and log-space adjustments for width, height) to make the anchor fit a potential object better.
- After generating proposals across all locations, Non-Maximum Suppression (NMS) is applied based on the objectness scores to reduce redundancy.
- RoI Pooling & Final Head: The high-scoring region proposals generated by the RPN are then used, just like the external proposals in Fast R-CNN. They are projected onto the same shared feature map, RoI Pooling extracts fixed-size features for each proposal, and these are fed into the final classification and bounding box regression layers to predict the specific object class and refine the box coordinates further.
Faster R-CNN architecture. The RPN generates proposals internally using the shared feature map, making the system nearly end-to-end and eliminating the Selective Search bottleneck.
The introduction of the RPN was significant because:
- Efficiency: It shares the expensive convolutional features with the downstream detection network, making region proposal generation almost computationally free.
- End-to-End Trainability: The entire system (backbone, RPN, detection head) can be trained jointly, although the original paper described a 4-step alternating training scheme to manage the dependencies between RPN training and Fast R-CNN detector training. Modern implementations often use approximate joint training.
- Improved Proposals: The RPN learns to generate proposals specifically tailored for the detection network, potentially leading to better quality proposals than fixed algorithms like Selective Search.
Faster R-CNN became a foundational architecture for many subsequent object detection models. While often surpassed in speed by single-stage detectors (which we discuss next), the two-stage approach, particularly the Faster R-CNN framework, often maintains an edge in localization accuracy for complex scenes. Understanding this R-CNN lineage provides essential context for appreciating the design choices and trade-offs in modern object detection systems.