Building upon the concept of dilated (or atrous) convolutions, which allow us to expand the receptive field of filters without increasing the number of parameters or significantly reducing spatial resolution, we now examine how to effectively capture contextual information at multiple scales. While a single dilated convolution enlarges the receptive field, objects in an image can appear at various sizes, requiring the model to understand context across different spatial extents simultaneously. Simple application of dilated convolutions with a fixed rate might not be sufficient for complex scenes.
The DeepLab family of models, developed by researchers at Google, represents a series of influential architectures specifically designed to tackle this multi-scale challenge in semantic segmentation. A central innovation introduced and refined across DeepLab versions (v1, v2, v3, v3+) is Atrous Spatial Pyramid Pooling (ASPP).
Atrous Spatial Pyramid Pooling (ASPP)
ASPP addresses the multi-scale problem by probing an incoming convolutional feature layer with multiple filters operating in parallel at different dilation rates. This effectively captures image context at several different scales simultaneously.
The core idea involves:
- Parallel Atrous Convolutions: Applying several parallel atrous convolutional layers with different dilation rates (e.g., rates of 6, 12, 18) to the same input feature map. Each rate captures information from a different-sized region around each feature point.
- 1x1 Convolution: Including a standard 1x1 convolution in parallel. This helps capture fine-grained information at the original scale.
- Image-Level Features (Global Context): Incorporating image-level context is often achieved through a global average pooling branch. The input feature map is pooled into a single feature vector, passed through a 1x1 convolution (often with Batch Normalization and ReLU activation), and then bilinearly upsampled back to the spatial dimensions of the input feature map. This branch provides global summary information.
- Concatenation and Fusion: The feature maps resulting from all parallel branches (atrous convolutions, 1x1 convolution, image pooling) are concatenated along the channel dimension.
- Final Processing: This combined feature map is typically passed through a final 1x1 convolution (again, often with Batch Normalization and ReLU) to fuse the multi-scale information and reduce the channel dimension, producing the final feature representation before the segmentation prediction layer.
The structure of an ASPP module can be visualized as follows:
Structure of the Atrous Spatial Pyramid Pooling (ASPP) module. The input feature map is processed by parallel branches with different characteristics (1x1 conv, atrous convs with varying rates, image pooling) before being concatenated and fused.
DeepLab Architecture Variants
The DeepLab models typically use a powerful CNN classifier (like ResNet or Xception) as a backbone network, but modify it for dense prediction tasks. This often involves:
- Replacing the final fully connected layers with convolutional layers.
- Removing later pooling layers or changing their strides.
- Using dilated convolutions in the later stages of the backbone to maintain a higher spatial resolution output (a larger feature map) without sacrificing the receptive field. For example, the output stride might be reduced from 32 (common in classification) to 16 or 8.
The ASPP module is then applied to the features extracted by this modified backbone.
- DeepLabv3: Improved upon DeepLabv2 by incorporating multi-scale context more robustly using ASPP with varying dilation rates and image-level features. It often applied batch normalization within ASPP.
- DeepLabv3+: Further enhanced the architecture by adding a simple yet effective decoder module. This decoder typically takes the rich semantic features from the ASPP output and fuses them with low-level features from earlier in the backbone network (which contain finer spatial details). This fusion usually involves upsampling the ASPP output and concatenating it with the low-level features (after a 1x1 convolution for channel reduction), followed by a few convolutional layers to refine the segmentation map, particularly improving predictions along object boundaries.
Advantages
The DeepLab approach, particularly with ASPP, offers significant advantages for semantic segmentation:
- Multi-Scale Context: Explicitly captures information at multiple scales, making the model robust to variations in object size.
- Controlled Feature Resolution: Leverages dilated convolutions to compute dense feature maps without requiring excessive memory or computation compared to traditional upsampling/deconvolution approaches alone.
- State-of-the-Art Performance: DeepLab variants have consistently achieved excellent results on standard semantic segmentation benchmarks like PASCAL VOC 2012 and Cityscapes.
By combining a strong backbone network with the multi-scale context aggregation capabilities of ASPP and potentially a refinement decoder, the DeepLab family provides a powerful framework for achieving detailed pixel-level understanding in images. When implementing or using DeepLab models, careful consideration must be given to the choice of backbone, the specific dilation rates used in ASPP, and the structure of the decoder (if used), as these elements impact performance and computational cost.