Layer Normalization (LN) is a fundamental component within each Transformer block, applied alongside the residual connections around the self-attention and feed-forward sub-layers. Its primary role is to stabilize the hidden state dynamics during training by normalizing the activations across the feature dimension for each position independently. This helps maintain consistent activation scales, smooths the loss surface, and generally improves gradient flow, making training deeper networks feasible.
However, the placement of the Layer Normalization step relative to the residual connection significantly impacts training dynamics and stability. The two predominant strategies are Post-Normalization (Post-LN), used in the original "Attention Is All You Need" paper, and Pre-Normalization (Pre-LN), which has gained popularity due to its enhanced stability. Let's examine each approach.
In the original Transformer architecture, Layer Normalization is applied after the output of a sub-layer (like Multi-Head Attention or the Feed-Forward Network) is added back to the input via the residual connection.
The computation flow for a sub-layer in a Post-LN block looks like this:
Data flow in a Post-Normalization Transformer block. Normalization occurs after the residual addition.
Characteristics:
To address the stability issues of Post-LN, the Pre-Normalization approach was proposed. Here, Layer Normalization is applied to the input before it enters the sub-layer module, but within the residual branch. The residual connection then adds the original, unmodified input x to the output of the sub-layer.
The computation flow for a sub-layer in a Pre-LN block is:
Data flow in a Pre-Normalization Transformer block. Normalization occurs before the sub-layer computation.
Characteristics:
Feature | Post-Normalization (Post-LN) | Pre-Normalization (Pre-LN) |
---|---|---|
Placement | LayerNorm(x + SubLayer(x)) |
x + SubLayer(LayerNorm(x)) |
Stability | Less stable, especially in deep models | More stable, facilitates deeper model training |
Warmup | Often requires careful LR warmup | Less sensitive to LR warmup, often trains without it |
Gradient Flow | Gradients pass through LN after addition | Gradients through residual path bypass LN |
Original Paper | Yes | No (later improvement) |
Modern Usage | Less common in very large models | Widely adopted, especially for large models |
Peak Performance | Can sometimes achieve slightly better peak results with extensive tuning | Generally easier to tune for good, stable results |
Hypothetical training loss curves. Pre-LN often shows stable convergence. Post-LN without warmup might diverge, while Post-LN with proper warmup can converge well, sometimes achieving slightly lower final loss than Pre-LN but requiring careful tuning.
While the original Transformer used Post-Normalization, the Pre-Normalization variant offers significant practical advantages in terms of training stability and reduced sensitivity to hyperparameter choices like the learning rate schedule. By normalizing the input before it passes through the complex self-attention and feed-forward layers, Pre-LN ensures a smoother optimization process, particularly critical when scaling Transformers to dozens or even hundreds of layers. For these reasons, Pre-LN is frequently the preferred choice in contemporary Transformer architectures. Understanding both configurations, however, provides valuable insight into the design choices and training dynamics of these powerful models.
© 2025 ApX Machine Learning