While Batch Normalization (BN) effectively addresses internal covariate shift by normalizing activations across a mini-batch, its reliance on batch statistics can sometimes be a limitation. For instance, BN's performance can degrade with very small mini-batch sizes, as the batch statistics become noisy estimates of the true population statistics. Furthermore, applying BN directly to recurrent neural networks (RNNs) can be tricky because the statistics need to be computed differently for each time step.
Layer Normalization (LN) offers an alternative approach that overcomes these specific limitations. Instead of normalizing across the batch dimension, Layer Normalization computes the mean and variance across all the hidden units in the same layer for a single training example.
Imagine the activations a for a specific layer l generated from a single input example x. Let H be the number of hidden units in that layer. Layer Normalization calculates the mean μ and variance σ2 using all the summed inputs to the neurons in that layer for that single example:
μ=H1i=1∑Hai σ2=H1i=1∑H(ai−μ)2Note that these calculations are performed independently for each training example and do not involve interactions across the batch.
Once the mean and variance are computed, the activations ai for each hidden unit i in that layer are normalized:
a^i=σ2+ϵai−μHere, ϵ is a small constant added for numerical stability, similar to its use in Batch Normalization.
Finally, just like Batch Normalization, Layer Normalization introduces learnable parameters: a per-neuron scale factor γ (gamma) and a shift factor β (beta). These parameters allow the network to learn the optimal scale and mean of the normalized activations, potentially even recovering the original activations if needed:
LN(ai)=γia^i+βiThese parameters γ and β are learned during training along with the network's other weights.
The primary distinction lies in the normalization axis:
This difference leads to several important properties of Layer Normalization:
The following diagram illustrates the different normalization dimensions:
Comparison of normalization dimensions for Batch Normalization (blue, column-wise across batch) and Layer Normalization (green, row-wise across features).
While BN is often the default choice for Convolutional Neural Networks (CNNs), LN has proven particularly useful in NLP tasks involving transformers and RNNs, and in situations where batch statistics might be unreliable. It serves as another valuable tool for stabilizing and potentially accelerating the training of deep neural networks. We won't delve into the implementation details here as deeply as we did for Batch Normalization, but most deep learning frameworks provide simple ways to add Layer Normalization layers.
© 2025 ApX Machine Learning