All Courses

Add & Norm Layers (Residual Connections)

As we assemble the Transformer, stacking layers upon layers allows the model to learn increasingly complex representations of the input sequence. However, training very deep neural networks presents challenges, primarily the risk of vanishing or exploding gradients, which can hinder learning. Furthermore, the distribution of activations within layers can shift during training, potentially slowing down convergence. The Transformer architecture incorporates two simple yet highly effective techniques within each sub-layer (both the self-attention and the feed-forward networks) to address these issues: Residual Connections (the "Add" part) and Layer Normalization (the "Norm" part).

Residual Connections: Enabling Deeper Networks

Imagine information flowing through a deep network. As it passes through successive transformations in each layer, it can become progressively altered, potentially losing details from the original input. Similarly, during backpropagation, gradients need to flow backward through all these layers. In very deep networks, these gradients can become extremely small (vanish) or extremely large (explode), making it difficult for the weights in the earlier layers to update effectively.

Residual connections, introduced in ResNet models, provide a direct path for information and gradients to bypass a transformation block. In the Transformer, the input to a sub-layer is added directly to the output of that sub-layer.

If $x$ is the input to a sub-layer (like Multi-Head Attention or the Feed-Forward Network), and $\text{Sublayer}(x)$ is the function computed by that sub-layer, the output of the residual connection is:

\text{Output} = x + \text{Sublayer}(x)

This addition operation creates a "shortcut" or "skip connection".

Benefit 1: Improved Gradient Flow: During backpropagation, the gradient can flow directly through the identity connection ( $x$ ) in addition to flowing through the sub-layer's transformation. This helps mitigate the vanishing gradient problem, allowing signals to reach earlier layers more effectively.
Benefit 2: Easier Identity Mapping: If a particular sub-layer transformation is not beneficial during training, the network can learn to effectively ignore it by driving the weights within the sub-layer towards zero. The residual connection makes it easier for the layer to approximate an identity mapping ( $Output \approx x$ ), simplifying the optimization process.

Layer Normalization: Stabilizing Activations

After the residual connection adds the original input $x$ to the sub-layer's output $\text{Sublayer}(x)$ , the result passes through a Layer Normalization step. Normalization techniques help stabilize the training dynamics by standardizing the inputs to subsequent layers.

While Batch Normalization is common in computer vision, Layer Normalization is often preferred in Transformers and other sequence processing tasks. Unlike Batch Normalization, which normalizes across the batch dimension, Layer Normalization normalizes the inputs across the feature dimension independently for each sequence element (token) in the batch. This means its computation doesn't depend on other examples in the batch, making it suitable for variable sequence lengths and scenarios where batch sizes might be small.

For a given output vector $y = x + \text{Sublayer}(x)$ corresponding to a single position in the sequence (with dimension $d_{model}$ ), Layer Normalization first calculates the mean ( $\mu$ ) and variance ( $\sigma^2$ ) across the $d_{model}$ features:

\mu = \frac{1}{d_{model}} \sum_{j=1}^{d_{model}} y_j

\sigma^2 = \frac{1}{d_{model}} \sum_{j=1}^{d_{model}} (y_j - \mu)^2

Then, it normalizes the vector $y$ using this mean and variance, and applies a learned scaling factor $\gamma$ (gamma) and a learned shifting factor $\beta$ (beta). These learnable parameters allow the network to scale and shift the normalized output, potentially restoring representational capacity if the raw normalization ( $\mu=0, \sigma=1$ ) is too restrictive. The final output is:

\text{LayerNorm}(y) = \gamma \frac{y - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

Here, $\epsilon$ (epsilon) is a small constant added to the variance for numerical stability, preventing division by zero. Both $\gamma$ and $\beta$ are learnable parameter vectors of dimension $d_{model}$ , initialized typically to ones and zeros, respectively.

Benefit 1: Stabilized Training: By normalizing the activations within each layer for every position, Layer Normalization helps maintain a more stable distribution of values throughout the network, reducing issues like internal covariate shift.
Benefit 2: Reduced Sensitivity: It often makes the model less sensitive to the scale of parameters and the initialization strategy.

The Combined "Add & Norm" Block

In the Transformer, each sub-layer (Multi-Head Attention and Position-wise Feed-Forward Network) in both the encoder and decoder is followed by this combined "Add & Norm" operation.

The flow within an Add & Norm block. The input $x$ passes through the sub-layer. The output of the sub-layer is then added to the original input $x$ via the residual connection. Finally, this sum is processed by Layer Normalization to produce the block's output.

This pattern, $\text{LayerNorm}(x + \text{Sublayer}(x))$ , is repeated consistently throughout the Transformer's encoder and decoder stacks. These seemingly simple additions and normalizations are fundamental ingredients that enable the training of the deep architectures characteristic of modern Transformer models, contributing significantly to their success on complex sequence modeling tasks.

Was this section helpful?