As we construct deeper neural networks, stacking multiple layers like the self-attention and feed-forward sublayers in the Transformer, we encounter significant challenges during training. Primarily, gradients can vanish (become extremely small) or explode (become extremely large) as they propagate backward through many layers. This makes it difficult for the model to learn effectively. Furthermore, the distribution of activations in intermediate layers can change during training (a phenomenon sometimes related to internal covariate shift), complicating the learning process. Two simple yet highly effective techniques, residual connections and layer normalization, are employed within the Transformer architecture to address these issues and enable the training of very deep models.

Residual Connections

Residual connections, also known as skip connections, provide an alternative path for the gradient to flow through the network. Instead of simply passing the output of a sublayer to the next, we add the input of the sublayer to its output.

If a sublayer is represented by a function $\text{SubLayer}(\cdot)$ , and its input is $x$ , the output of the block with a residual connection is:

\text{Output} = x + \text{SubLayer}(x)

This structure allows the network to easily learn an identity function if a particular sublayer is not beneficial; the sublayer's output can simply be driven towards zero. More importantly, during backpropagation, the gradient can flow directly through the addition operation from the output back to the input $x$ . This bypasses the transformations within the sublayer, providing a "shortcut" that helps prevent the gradient signal from diminishing excessively as it travels through many layers.

A residual connection adds the input x to the output of the SubLayer.

In PyTorch, this is straightforward to implement. Assuming sublayer is a module (like multi-head attention or a feed-forward network) and x is the input tensor:

import torch
import torch.nn as nn

# Assume 'sublayer' is defined elsewhere (e.g., MultiHeadAttention, FeedForward)
# class SubLayer(nn.Module):
#     def __init__(self, d_model, ...):
#         super().__init__()
#         # ... define layers ...
#     def forward(self, x):
#         # ... compute sublayer output ...
#         return processed_x

class ResidualConnection(nn.Module):
    def __init__(self, sublayer):
        super().__init__()
        self.sublayer = sublayer

    def forward(self, x):
        """
        Apply residual connection to any sublayer.
        """
        # Add the original input 'x' to the output of the sublayer
        return x + self.sublayer(x)

# Example Usage
# d_model = 512
# input_tensor = torch.randn(batch_size, seq_len, d_model)
# attention_layer = MultiHeadAttention(...) # Assume defined
# residual_block = ResidualConnection(attention_layer)
# output_tensor = residual_block(input_tensor)

Layer Normalization

Normalization techniques help stabilize the training process by controlling the distribution of activations. While Batch Normalization is common in computer vision, it normalizes statistics (mean and variance) across the batch dimension. This can be problematic for sequence models where sequence lengths might vary within a batch, and it introduces dependencies between batch elements that aren't always desirable.

Layer Normalization (LayerNorm) offers an alternative. It normalizes the inputs across the features for each data point (e.g., each token in a sequence) independently. It calculates the mean and variance used for normalization from all the summed inputs to the neurons within a single layer on a single training example.

Given an input vector $x$ (representing the activations for a single token position across all its features, $d_{model}$ ), LayerNorm computes the normalized output $h$ as follows:

Calculate the mean ( $\mu$ ) and variance ( $\sigma^2$ ) across the feature dimension ( $d_{model}$ ):
$\mu = \frac{1}{d_{model}} \sum_{i=1}^{d_{model}} x_i$ $\sigma^2 = \frac{1}{d_{model}} \sum_{i=1}^{d_{model}} (x_i - \mu)^2$
Normalize the input $x$ :
$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$
where $\epsilon$ is a small constant added for numerical stability.
Scale and shift the normalized output using learnable parameters $\gamma$ (gamma, scale) and $\beta$ (beta, shift), which have the same dimension as $x$ :
$h_i = \gamma_i \hat{x}_i + \beta_i$

These learnable parameters $\gamma$ and $\beta$ allow the network to adaptively scale and shift the normalized activations, potentially even recovering the original activations if that proves optimal for the network. LayerNorm helps stabilize hidden state dynamics, reduces sensitivity to initialization scales, and can even provide a slight regularization effect.

In PyTorch, torch.nn.LayerNorm implements this:

import torch
import torch.nn as nn

# Example parameters
batch_size = 4
seq_len = 10
d_model = 512
epsilon = 1e-5 # Small value for numerical stability

# Input tensor (batch, sequence length, features)
input_tensor = torch.randn(batch_size, seq_len, d_model)

# Initialize Layer Normalization
# Normalizes over the last dimension (d_model) by default
layer_norm = nn.LayerNorm(d_model, eps=epsilon)

# Apply Layer Normalization
normalized_output = layer_norm(input_tensor)

# Check shapes
print("Input shape:", input_tensor.shape)
print("Output shape:", normalized_output.shape)

# Verify mean and std dev for one example in the batch/sequence
# Note: Due to epsilon and learnable parameters gamma/beta,
# the output mean/std won't be exactly 0/1 unless gamma=1, beta=0.
# But the normalization happens internally *before*
# gamma/beta application.
print(
    "\nMean of normalized output (example 0, token 0):",
    normalized_output[0, 0, :].mean().item()
)
print(
    "Std dev of normalized output (example 0, token 0):",
    normalized_output[0, 0, :].std().item()
)

# nn.LayerNorm has learnable parameters gamma (weight) and beta (bias)
print("\nLayerNorm learnable gamma (weight):", layer_norm.weight.shape)
print("LayerNorm learnable beta (bias):", layer_norm.bias.shape)

Combining Residual Connections and Layer Normalization

In the Transformer architecture, Layer Normalization and residual connections are typically applied together around each sublayer (both the multi-head attention and the position-wise feed-forward network). The standard structure described in the original "Attention Is All You Need" paper applies the normalization after the residual addition (Post-LN):

$\text{Output} = \text{LayerNorm}(x + \text{SubLayer}(x))$

However, subsequent research and practice have often found that applying Layer Normalization before the sublayer within the residual branch (Pre-LN) can lead to more stable training, especially for very deep Transformers:

$\text{Output} = x + \text{SubLayer}(\text{LayerNorm}(x))$

We will explore the implications of Pre-LN vs. Post-LN further in Chapter 11 when discussing scaling laws and architectural choices. For now, recognize that the combination, often depicted as an "Add & Norm" step, is fundamental.

Comparison of Post-LN (normalization after addition) and Pre-LN (normalization before the sublayer) structures within a residual block.

Here's how a Transformer Encoder layer often combines these using the Pre-LN approach:

import torch
import torch.nn as nn
# Assume MultiHeadAttention and PositionwiseFeedForward are defined classes
# from previous sections / external modules

class EncoderLayer(nn.Module):
    """
    Implements one Transformer Encoder layer with Pre-LN.
    """
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads) # Placeholder
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff) # Placeholder
        
        # Layer Normalization instances
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout for regularization
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # 1. Layer Normalization followed by Multi-Head Self-Attention
        norm_x = self.norm1(x)
        # Assume self_attn returns the attention output
        attn_output = self.self_attn(norm_x, norm_x, norm_x, mask) 
        # Residual connection with dropout
        x = x + self.dropout1(attn_output)

        # 2. Layer Normalization followed by Position-wise Feed-Forward
        norm_x = self.norm2(x)
        ff_output = self.feed_forward(norm_x)
        # Residual connection with dropout
        x = x + self.dropout2(ff_output)
        
        return x

# Example Usage
# d_model = 512
# num_heads = 8
# d_ff = 2048 # Feed-forward inner dimension
# batch_size = 4
# seq_len = 10

# input_tensor = torch.randn(batch_size, seq_len, d_model)
# encoder_layer = EncoderLayer(d_model, num_heads, d_ff)
# output_tensor = encoder_layer(input_tensor)
# print("Encoder Layer Output Shape:", output_tensor.shape)

In summary, residual connections facilitate gradient flow and information propagation through deep networks, while layer normalization stabilizes activation distributions. Their combined use is a critical factor enabling the successful training of deep Transformer models, forming the backbone of the encoder and decoder layers.

Was this section helpful?