Standard convolutional networks often rely on pooling layers to gradually reduce spatial resolution while increasing the receptive field. This works well for image classification, where the final output is a single label for the entire image. However, for semantic segmentation, we need dense, pixel-wise predictions. Aggressive downsampling followed by upsampling (as seen in basic Fully Convolutional Networks or Encoder-Decoder structures) can lead to a loss of fine-grained spatial information, making it difficult to accurately delineate object boundaries.
Dilated convolutions, also known as atrous convolutions (from the French "à trous," meaning "with holes"), offer an alternative way to increase the receptive field of convolutional layers without reducing spatial resolution or significantly increasing the number of parameters.
The core idea is to introduce gaps between the kernel weights when applying the convolution. A standard 3×3 convolution kernel operates on a contiguous 3×3 patch of the input feature map. A dilated convolution introduces a dilation rate
, denoted by r. A dilation rate r means the kernel weights are spaced r−1 pixels apart.
Consider a 1D example. Let x be the input signal and w be a filter of size K. The output y[i] of a standard 1D convolution is: y[i]=∑k=1Kx[i+k−1]⋅w[k]
The output y[i] of a 1D dilated convolution with dilation rate r is: y[i]=∑k=1Kx[i+r⋅(k−1)]⋅w[k]
Notice how the input index is spaced by the dilation rate r. This effectively expands the kernel's view without adding parameters or requiring a larger kernel size.
Comparison of receptive fields for a 3×3 kernel. Left: Standard convolution (r=1) covers a 3×3 input area (blue). Right: Dilated convolution with r=2 uses the same 9 weights but covers a 5×5 area by sampling input pixels (green) with gaps (gray). Both compute a single output value (red).
Why is this modification particularly useful for semantic segmentation?
While powerful, naively stacking dilated convolutions with the same dilation rate can lead to a problem known as the "gridding effect." Because the kernel consistently samples input locations in a fixed sparse pattern, some input pixels might never contribute to the output calculation in deeper layers. This can create checkerboard-like artifacts in the predictions.
A common strategy to mitigate this is to use varying dilation rates in consecutive layers. For example, using rates like [1, 2, 5] in subsequent layers ensures that all input pixels eventually contribute to the output over a series of layers. This approach is sometimes referred to as Hybrid Dilated Convolution (HDC).
Dilated convolutions are a foundational component in several state-of-the-art semantic segmentation models, most notably the DeepLab family. DeepLab architectures leverage dilated convolutions extensively in their backbone networks to extract dense feature maps. They often combine dilated convolutions with different rates in parallel branches using a module called Atrous Spatial Pyramid Pooling (ASPP), which we will discuss in the next section. ASPP explicitly probes the incoming feature map with multiple dilation rates to capture object context at different scales.
In summary, dilated convolutions provide an effective mechanism to control the receptive field size in CNNs while preserving spatial resolution, addressing a significant challenge in designing networks for dense prediction tasks like semantic segmentation. They allow models to aggregate multi-scale contextual information efficiently, leading to improved segmentation accuracy, especially for larger objects and complex scenes.
© 2025 ApX Machine Learning