Standard Convolutional Neural Networks, particularly those designed for image classification tasks like VGG or ResNet, typically end with one or more fully connected layers. These layers aggregate the spatial information learned by the convolutional filters into a fixed-size vector, suitable for predicting a single class label for the entire input image. However, this aggregation process discards the precise spatial information needed for segmentation, where we require a prediction for every pixel. If you flatten the feature map from the last convolutional layer, you lose the 2D structure.
Fully Convolutional Networks (FCNs), introduced by Long, Shelhamer, and Darrell, provide an elegant solution. The fundamental idea is to replace the fully connected layers in a classification network with equivalent convolutional layers. For instance, a fully connected layer operating on a 7×7×512 feature map (common before the final classification layer in VGG) can be viewed as a convolution with a 7×7 kernel applied to that feature map, producing a 1×1 output map for each filter (class). By using 1×1 convolutions in place of fully connected layers, FCNs preserve the spatial dimensions throughout the network, allowing them to output a heatmap or a dense prediction map corresponding to the input image's spatial layout.
FCNs typically adopt an encoder-decoder structure:
A significant challenge with the basic encoder-decoder structure is that the deep, low-resolution feature maps capture semantic context but lack precise localization information lost during pooling. Upsampling these coarse maps alone often results in blurry or poorly defined segmentation boundaries.
FCNs address this by incorporating skip connections. These connections bridge feature maps from the earlier, higher-resolution layers in the encoder directly to corresponding layers in the decoder. The decoder then combines the coarse, semantic information from the deep layers with the fine-grained, spatial detail from the shallower layers.
For example, the output from the deep encoder layers might be upsampled by a factor of 2. This upsampled map is then element-wise summed with the feature map from an earlier encoder layer that has the same spatial dimensions. This combined map is then further upsampled and potentially combined with features from even earlier layers.
Simplified FCN architecture showing the encoder reducing resolution, the decoder increasing resolution via upsampling, and skip connections combining features from different stages. N represents the number of segmentation classes.
The original FCN paper proposed several variants based on the stride of the final prediction layer and the number of skip connections used:
Each successive variant (32s -> 16s -> 8s) incorporates more fine-grained detail from earlier layers, leading to progressively sharper and more accurate segmentation boundaries.
FCNs are typically trained end-to-end using a pixel-wise loss function. The most common choice is the average cross-entropy loss calculated independently for each pixel. For a single pixel i, the loss is Li=−∑c=1Nclassesyi,clog(pi,c), where yi,c is 1 if the true class of pixel i is c and 0 otherwise, and pi,c is the predicted probability (usually after a softmax activation) that pixel i belongs to class c. The total loss for the image is the average of Li over all pixels.
FCNs represented a major step forward for semantic segmentation with deep learning. They demonstrated that convolutional networks could be trained end-to-end for dense prediction tasks, leveraging pre-trained classification networks effectively. They also introduced the powerful concepts of converting fully connected layers to convolutions and using skip connections to fuse multi-level features.
However, FCNs still have limitations:
These limitations motivated the development of subsequent architectures like U-Net (which refines the symmetric encoder-decoder structure with more extensive skip connections) and DeepLab (which introduces dilated convolutions to manage spatial resolution differently), as well as instance segmentation methods like Mask R-CNN, which we will explore later in this chapter. Nonetheless, understanding FCNs is foundational for grasping the core principles behind modern image segmentation networks.
© 2025 ApX Machine Learning