While semantic segmentation assigns a class label (like 'car', 'person', 'road') to every pixel in an image, instance segmentation takes this a step further. It aims to identify and delineate each distinct object instance within the image. So, instead of labeling all pixels belonging to people as 'person', it would output separate masks for 'person 1', 'person 2', and 'person 3'. This provides a much richer understanding of the scene, separating overlapping objects of the same class.
Instance segmentation inherently combines elements of object detection (locating individual objects with bounding boxes) and semantic segmentation (classifying pixels). One of the most influential and effective approaches for this task is Mask R-CNN.
Mask R-CNN, developed by researchers at Facebook AI Research (FAIR), builds directly upon the Faster R-CNN object detection framework. Recall that Faster R-CNN is a two-stage detector:
Mask R-CNN extends this second stage by adding a third parallel branch that predicts a segmentation mask for each RoI, in addition to the existing branches for classification and bounding box regression.
Overview of Mask R-CNN. It adds a parallel mask prediction branch (yellow) to the standard Faster R-CNN heads (blue) that process features from RoIAlign.
The core addition in Mask R-CNN is this mask head. It's typically implemented as a small Fully Convolutional Network (FCN) applied to each RoI.
A significant innovation introduced with Mask R-CNN is RoIAlign. The original RoIPool operation used in Faster R-CNN involves quantization steps. When mapping an RoI (which has continuous coordinates) onto the discrete grid of a feature map, RoIPool rounds the coordinates. It then divides the RoI into spatial bins and max-pools features within each bin. While this works well enough for classification and bounding box regression, the spatial quantization inaccuracies are detrimental to predicting pixel-accurate masks. A small misalignment caused by rounding can significantly degrade the quality of the segmentation boundary.
RoIAlign avoids this quantization. Instead of rounding the RoI boundaries, it uses bilinear interpolation to compute the exact values of the input features at four regularly sampled locations within each RoI bin. These sampled values are then aggregated (usually through max or average pooling) to produce the pooled feature map for that bin. This process preserves precise spatial location information, leading to much better alignment between the extracted features and the original image region, which is vital for high-quality instance segmentation.
Mask R-CNN is trained end-to-end using a multi-task loss function. The total loss L on each sampled RoI is the sum of the classification loss Lcls, the bounding box regression loss Lbox, and the mask segmentation loss Lmask:
L=Lcls+Lbox+LmaskThis multi-task training allows the network to simultaneously learn to classify objects, refine their bounding boxes, and generate detailed segmentation masks, all leveraging shared convolutional features from the backbone network.
During inference, Mask R-CNN follows the Faster R-CNN procedure to generate RoIs via the RPN. For each proposed RoI, it predicts a class label, a bounding box refinement, and a pixel-wise mask. A confidence score threshold is applied to filter detections. Non-maximum suppression (NMS) is performed based on the bounding boxes to remove duplicate detections. For the surviving detections, the corresponding class-specific mask predicted by the mask head is selected, resized to the final bounding box dimensions, and typically thresholded at 0.5 to yield the final binary instance mask.
The output is a set of objects, each with a class label, a confidence score, a bounding box, and a precise pixel-level segmentation mask identifying that specific instance.
Mask R-CNN demonstrated state-of-the-art performance on instance segmentation benchmarks upon its release and remains a strong and widely used baseline. Its design elegantly combines object detection and segmentation into a single, trainable framework, paving the way for many subsequent developments in instance-level scene understanding. While other approaches exist, including single-stage methods designed for speed (like YOLACT or SOLO), understanding Mask R-CNN provides a solid foundation for tackling instance segmentation problems.
© 2025 ApX Machine Learning