Initializing the weights of a neural network correctly is a fundamental step towards stable and efficient training. For deep architectures like Transformers, where gradients propagate through numerous layers, poor initialization can easily lead to vanishing or exploding gradients, stalling the learning process before it even begins. While modern optimizers and normalization layers alleviate some of these issues, thoughtful weight initialization remains a significant part of the practical implementation puzzle.
The goal of most standard initialization schemes is to maintain the variance of activations and gradients as they propagate forward and backward through the network. If the variance increases exponentially with each layer, gradients explode; if it decreases exponentially, they vanish.
Transformers heavily rely on linear (or dense) layers within their Multi-Head Attention modules (for Q, K, V projections and the final output projection) and the Position-wise Feed-Forward Networks (FFNs). The most common and effective initialization strategies for these layers are Glorot (Xavier) and He initialization.
Glorot (Xavier) Initialization: Proposed by Glorot and Bengio (2010), this method aims to keep the variance of activations and gradients constant across layers. It's particularly effective for layers with symmetric activation functions like tanh. For a linear layer transforming an input of size fanin to an output of size fanout, the weights W are typically drawn from a uniform distribution:
W∼U[−fanin+fanout6,fanin+fanout6]Alternatively, a normal distribution can be used:
W∼N(0,fanin+fanout2)He Initialization: Proposed by He et al. (2015), this method is specifically designed for layers followed by Rectified Linear Units (ReLU) or its variants, which are common in Transformer FFNs. It accounts for the fact that ReLU zeros out half of the activations, reducing the variance. Weights are typically drawn from a normal distribution:
W∼N(0,fanin2)A uniform version also exists.
In practice, many deep learning frameworks default to He initialization for linear layers when ReLU activations are anticipated, and Glorot otherwise. For Transformers, using He initialization for the FFN layers and Glorot/Xavier for the attention projection layers (which don't typically have an immediate ReLU applied to their direct output before further combination) is a reasonable starting point. Implementations like the original "Attention Is All You Need" paper used Glorot uniform initialization.
Input and output embedding layers map discrete token IDs to dense vectors. These are typically initialized differently from standard linear layers. A common practice is to initialize embedding weights from a normal distribution with a mean of 0 and a relatively small standard deviation, such as N(0,0.022). This prevents the initial embeddings from having excessively large magnitudes, which could destabilize the subsequent computations, especially the addition of positional encodings.
Some implementations might tie the input and output embedding weights, sharing the same matrix (potentially transposed) for both, which also influences how initialization might be approached.
Layer Normalization layers also have trainable parameters: a gain (γ) and a bias (β). These are typically initialized to γ=1 and β=0. This ensures that initially, the Layer Normalization simply normalizes the activations to have zero mean and unit variance without applying any scaling or shifting, allowing the network to learn the appropriate transformations during training.
Most modern deep learning libraries (PyTorch, TensorFlow, JAX) provide sensible default initializations for their standard layers, often aligning with Glorot or He methods. The Hugging Face Transformers library, for instance, generally initializes weights using a normal distribution with a standard deviation specified by an initializer_range
configuration parameter (often defaulting to 0.02), and initializes biases to zero, with LayerNorm weights set to 1 and biases to 0.
Comparison of weight distributions resulting from initialization with a small standard deviation normal distribution versus Xavier uniform initialization for a layer with 512 input/output units. Note the tighter clustering around zero for the N(0, 0.02^2) method compared to the wider spread of Xavier.
While these defaults often work well, understanding the underlying principles allows for more informed debugging if training becomes unstable. Occasionally, minor adjustments to the initialization standard deviation, particularly for embedding layers or the final output layer, might be experimented with, but deviating drastically from established methods like Glorot or He for the core linear layers is usually unnecessary and potentially counterproductive.
© 2025 ApX Machine Learning