As networks grew deeper following architectures like VGG, researchers encountered a significant challenge: accuracy would saturate and then rapidly degrade. This wasn't necessarily caused by overfitting, as the training error itself started increasing with more layers. This phenomenon, known as the degradation problem, indicated that standard deep networks were becoming increasingly difficult to optimize. Simply stacking layers made it harder for the solver to find good solutions, partly due to issues like vanishing or exploding gradients during backpropagation.
Instead of hoping network layers could directly learn a desired underlying mapping, say H(x), the innovation behind residual networks (ResNets) was to reframe the problem. The core idea is to let the stacked layers learn a residual function F(x) relative to the layer's input x. The original desired mapping H(x) is then obtained by adding the input back:
H(x)=F(x)+x
This reformulation is based on the hypothesis that it's easier to optimize the residual mapping F(x) than the original, unreferenced mapping H(x). In the extreme case, if an identity mapping were optimal (H(x)=x), it would be much easier for the stacked layers to learn to push the weights of F(x) towards zero, rather than trying to learn the identity function from scratch through multiple non-linear layers.
This concept is implemented using residual blocks. A typical residual block consists of a few stacked layers (e.g., two or three convolutional layers, often with Batch Normalization and ReLU activations) and a shortcut or skip connection that bypasses these layers and performs an element-wise addition with the output of the stacked layers.
The equation for the output y of a residual block is often written as:
y=F(x,{Wi})+x
Here:
The shortcut connection typically performs an identity mapping, meaning it directly passes the input x to the addition operation. This identity mapping is important: it introduces no extra parameters and adds no computational complexity.
A visualization of a common residual block. The input x flows through the main path (F(x) involving convolutions, batch normalization, and activations) and simultaneously bypasses these layers via the identity shortcut. The outputs are combined using element-wise addition, followed by a final activation (like ReLU).
What happens if the dimensions of the input x and the output of the residual function F(x) don't match? This often occurs when a convolutional layer in F(x) uses a stride greater than 1 or changes the number of filters. In such cases, the identity shortcut x cannot be directly added.
Two common strategies are employed:
Zero-Padding: Pad the input x with extra zeros to increase its dimensions to match the output of F(x).
Projection Shortcut: Use a projection, typically a 1x1 convolution, in the shortcut connection to explicitly match the dimensions (both spatial and depth/channel dimensions). The equation becomes:
y=F(x,{Wi})+Wsx
Where Ws is the matrix (implemented via 1x1 convolution) for the linear projection. While adding parameters and computation, this approach can offer more representational power. The original ResNet paper explored both and found the projection shortcut slightly better, but the identity shortcut is more computationally efficient and often sufficient.
Residual connections address the degradation problem and facilitate training deeper networks through several mechanisms:
The original ResNet architecture applied the final activation (ReLU) after the element-wise addition. Subsequent research introduced variations like the "pre-activation" ResNet block (He et al., 2016). In this variant, Batch Normalization and ReLU activation are applied before the convolutional layers within the residual path F(x). This design can lead to improved regularization and performance by providing a cleaner information path through the network and making optimization potentially easier.
While ResNet popularized the term "residual connection" specifically for the y=F(x)+x formulation, the general idea of skip connections – connections that bypass one or more layers – appears in other successful architectures.
For instance, U-Net, commonly used for image segmentation (covered in Chapter 4), employs long skip connections that concatenate feature maps from the contracting path (encoder) to corresponding layers in the expanding path (decoder). These connections help the decoder recover fine-grained spatial information lost during pooling operations in the encoder. DenseNets (discussed next in this chapter) use a different, more extensive form of feature concatenation across layers.
These examples highlight that strategically connecting layers across different depths is a powerful architectural pattern for improving information flow and enabling the training of deep, high-performing networks for various computer vision tasks. Understanding residual connections provides a foundation for appreciating these more complex connectivity patterns.
© 2025 ApX Machine Learning