As we stack multiple encoder or decoder layers to build deeper Transformer models, ensuring stable and efficient training becomes a significant challenge. The activations flowing through the network can exhibit widely varying distributions across layers and training steps, a phenomenon sometimes related to internal covariate shift. Moreover, deep networks are prone to vanishing or exploding gradients, hindering effective learning. Layer Normalization (LayerNorm) is a technique integrated into the Transformer architecture specifically to mitigate these issues.
Unlike Batch Normalization, which normalizes across the batch dimension, Layer Normalization operates independently on each data sample (each sequence in a batch) and normalizes across the feature dimension (the embedding dimension, dmodel). This makes it particularly well-suited for sequence data where batch statistics might be less representative or where variable sequence lengths complicate Batch Normalization.
For a given input vector x representing the activations at a specific position in a sequence within a layer (typically of dimension dmodel), Layer Normalization first calculates the mean (μ) and variance (σ2) of its elements:
μ=dmodel1i=1∑dmodelxi σ2=dmodel1i=1∑dmodel(xi−μ)2Next, it normalizes the input vector x using this mean and variance, adding a small epsilon (ϵ, e.g., 10−5) for numerical stability:
x^i=σ2+ϵxi−μFinally, the normalized output x^ is scaled and shifted using two learnable parameter vectors: a gain (or scale) parameter γ and a bias (or shift) parameter β, both of dimension dmodel. These parameters are learned during training alongside the other model weights. They allow the network to adaptively determine the optimal scale and location for the normalized activations, potentially even recovering the original activations if needed (γ=σ2+ϵ, β=μ).
yi=γix^i+βiThis entire operation (LayerNorm(x)=y) standardizes the inputs to the subsequent sub-layer (like multi-head attention or the feed-forward network), helping to stabilize the hidden state dynamics and improve gradient flow.
The original Transformer paper ("Attention Is All You Need") placed Layer Normalization after the residual connection, a configuration now known as Post-LN:
output=LayerNorm(x+Sublayer(x))
Here, x is the input to the sub-layer (e.g., Multi-Head Attention or FFN), and Sublayer(x) is the output of that sub-layer.
Data flow in a Post-LN Transformer sub-layer. Normalization happens after the residual addition.
While effective, Post-LN can sometimes lead to training difficulties, especially in very deep models, because the output of each block isn't normalized before being passed to the next. Gradients might struggle to flow effectively back through the unnormalized residual stream.
More recent implementations often favor Pre-LN, where Layer Normalization is applied before the sub-layer, within the main branch of the residual connection:
output=x+Sublayer(LayerNorm(x))
Data flow in a Pre-LN Transformer sub-layer. Normalization happens on the input before the sub-layer transformation.
Pre-LN tends to stabilize training, often allowing for higher learning rates and eliminating the need for extensive learning rate warmup schedules (though warmup is still commonly used). The gradients flow more directly through the normalized activations, and the residual path remains 'clean'. Most modern large language models utilize the Pre-LN structure.
Regardless of placement, Layer Normalization is applied twice within each encoder and decoder layer: once before (Pre-LN) or after (Post-LN) the self-attention mechanism, and once before or after the position-wise feed-forward network. In the decoder, there's a third LayerNorm associated with the encoder-decoder cross-attention mechanism.
In summary, Layer Normalization is a critical component that works in concert with residual connections to enable the training of deep Transformer stacks. By stabilizing activation distributions independently for each sequence position, it smooths the optimization process and contributes significantly to the success of these powerful models.
© 2025 ApX Machine Learning