As we assemble the Transformer, stacking layers upon layers allows the model to learn increasingly complex representations of the input sequence. However, training very deep neural networks presents challenges, primarily the risk of vanishing or exploding gradients, which can hinder learning. Furthermore, the distribution of activations within layers can shift during training, potentially slowing down convergence. The Transformer architecture incorporates two simple yet highly effective techniques within each sub-layer (both the self-attention and the feed-forward networks) to address these issues: Residual Connections (the "Add" part) and Layer Normalization (the "Norm" part).
Imagine information flowing through a deep network. As it passes through successive transformations in each layer, it can become progressively altered, potentially losing details from the original input. Similarly, during backpropagation, gradients need to flow backward through all these layers. In very deep networks, these gradients can become extremely small (vanish) or extremely large (explode), making it difficult for the weights in the earlier layers to update effectively.
Residual connections, introduced in ResNet models, provide a direct path for information and gradients to bypass a transformation block. In the Transformer, the input to a sub-layer is added directly to the output of that sub-layer.
If x is the input to a sub-layer (like Multi-Head Attention or the Feed-Forward Network), and Sublayer(x) is the function computed by that sub-layer, the output of the residual connection is:
Output=x+Sublayer(x)This addition operation creates a "shortcut" or "skip connection".
After the residual connection adds the original input x to the sub-layer's output Sublayer(x), the result passes through a Layer Normalization step. Normalization techniques help stabilize the training dynamics by standardizing the inputs to subsequent layers.
While Batch Normalization is common in computer vision, Layer Normalization is often preferred in Transformers and other sequence processing tasks. Unlike Batch Normalization, which normalizes across the batch dimension, Layer Normalization normalizes the inputs across the feature dimension independently for each sequence element (token) in the batch. This means its computation doesn't depend on other examples in the batch, making it suitable for variable sequence lengths and scenarios where batch sizes might be small.
For a given output vector y=x+Sublayer(x) corresponding to a single position in the sequence (with dimension dmodel), Layer Normalization first calculates the mean (μ) and variance (σ2) across the dmodel features:
μ=dmodel1j=1∑dmodelyj σ2=dmodel1j=1∑dmodel(yj−μ)2Then, it normalizes the vector y using this mean and variance, and applies a learned scaling factor γ (gamma) and a learned shifting factor β (beta). These learnable parameters allow the network to scale and shift the normalized output, potentially restoring representational capacity if the raw normalization (μ=0,σ=1) is too restrictive. The final output is:
LayerNorm(y)=γσ2+ϵy−μ+βHere, ϵ (epsilon) is a small constant added to the variance for numerical stability, preventing division by zero. Both γ and β are learnable parameter vectors of dimension dmodel, initialized typically to ones and zeros, respectively.
In the Transformer, each sub-layer (Multi-Head Attention and Position-wise Feed-Forward Network) in both the encoder and decoder is followed by this combined "Add & Norm" operation.
The flow within an Add & Norm block. The input x passes through the sub-layer. The output of the sub-layer is then added to the original input x via the residual connection. Finally, this sum is processed by Layer Normalization to produce the block's output.
This pattern, LayerNorm(x+Sublayer(x)), is repeated consistently throughout the Transformer's encoder and decoder stacks. These seemingly simple additions and normalizations are fundamental ingredients that enable the training of the deep architectures characteristic of modern Transformer models, contributing significantly to their success on complex sequence modeling tasks.
© 2025 ApX Machine Learning