As we assemble the Transformer's encoder and decoder layers, incorporating multi-head attention and feed-forward networks, we encounter a fundamental challenge common to deep neural networks: training effectiveness. Stacking many layers, while increasing model capacity, can make optimization difficult. Gradients, the signals used for learning, can weaken or vanish as they propagate backward through numerous layers, hindering the updates of earlier layers. Similarly, gradients can sometimes become excessively large, leading to instability.
To combat these issues and enable the training of the deep architectures characteristic of Transformers, residual connections, also known as skip connections, are employed. This technique, popularized by Residual Networks (ResNets) in computer vision, provides a direct path for gradients to flow through the network.
In the context of a Transformer layer, each main sub-layer (the multi-head self-attention mechanism and the position-wise feed-forward network) is wrapped within a residual connection. The core idea is remarkably simple: the input to the sub-layer is added to the output of that sub-layer.
Let x represent the input to a sub-layer (e.g., the output from the previous layer or the initial embeddings). Let SubLayer(x) denote the function implemented by the sub-layer itself (e.g., multi-head attention or the FFN). The output of the operation, before normalization, is calculated as:
ResidualOutput=x+SubLayer(x)This addition is typically followed immediately by Layer Normalization, forming the "Add & Norm" blocks seen frequently in Transformer diagrams.
A residual connection bypasses a sub-layer and adds the original input x to the sub-layer's output SubLayer(x).
Improved Gradient Flow: During backpropagation, the gradient calculation for the residual block involves a direct path back through the addition operation. The gradient of the output with respect to the input x includes a +1 term from this direct connection. This ensures that even if the gradient through the SubLayer(x) path becomes very small, there's still a gradient signal flowing back relatively unimpeded through the identity connection (x). This significantly alleviates the vanishing gradient problem, allowing information to propagate more effectively across many layers.
Identity Mapping Encouragement: The residual connection makes it easier for a layer to learn an identity function. If the optimal transformation for a given layer is simply to pass the input through unchanged, the network can achieve this by driving the weights within the SubLayer towards zero. Without the residual connection, learning an exact identity mapping using complex non-linear transformations can be difficult. This property allows layers to be added without necessarily harming performance; the network can effectively "ignore" a layer if needed.
Enabling Deeper Networks: By mitigating gradient issues and simplifying the learning of identity mappings, residual connections are fundamental to successfully training the very deep networks (often 6, 12, 24, or more layers in both the encoder and decoder stacks) that characterize powerful Transformer models.
Within both the encoder and decoder layers of a standard Transformer, you'll find two primary sub-layers, each followed by this "Add & Norm" step:
Multi-Head Attention: The input x is passed into the multi-head attention mechanism. The output, Attention(x), is then added to the original input x. Layer normalization is applied to this sum. OutputAttnā=LayerNorm(x+MultiHeadAttention(x))
Feed-Forward Network: The output from the first normalization step, let's call it y=OutputAttnā, serves as the input to the position-wise feed-forward network. The output of the FFN, FFN(y), is added to its input y. A final layer normalization is applied. OutputLayerā=LayerNorm(y+FFN(y))
These residual connections are a deceptively simple yet profoundly effective technique, forming an integral part of the Transformer's design and contributing significantly to its success in modeling complex sequential data. Without them, training the deep stacks of layers typical of modern large language models would be substantially more challenging.
Ā© 2025 ApX Machine Learning