We've examined the internal machinery of individual encoder and decoder layers. A single layer, however, typically performs only a limited transformation on its input representations. The true strength of the Transformer architecture arises from composing these layers sequentially, creating deep stacks for both the encoder and the decoder. The original Transformer model, for instance, utilized a stack of N=6 identical layers for the encoder and another stack of N=6 identical layers for the decoder.
Why Stack Layers?
Stacking layers allows the model to learn increasingly complex representations of the input data hierarchically. Much like how convolutional neural networks build up representations from simple edges to complex objects across layers, stacked Transformer layers progressively refine sequence representations.
- Hierarchical Processing: Initial layers might focus on local context and dependencies within the sequence. Subsequent layers can then integrate information over longer distances, leveraging the refined representations from lower layers to capture more global relationships and abstract features. The multi-head attention mechanism within each layer allows different heads to focus on different aspects, and stacking enables the model to build upon these diverse perspectives layer by layer.
- Increased Model Capacity: Each layer adds computational depth and parametric complexity. With more layers, the model has greater capacity to approximate the intricate functions required for tasks like machine translation or text generation. The sequential application of self-attention, cross-attention (in the decoder), and feed-forward transformations through multiple layers enables a highly non-linear and powerful mapping from input to output sequences.
The Stacking Mechanism
In the standard Transformer architecture, both the encoder and decoder consist of a specified number, N, of identical layers stacked consecutively. Although the layers share the same structure (same sub-layers and dimensionalities), each layer has its own unique set of trainable weights.
- Encoder Stack: The input sequence (token embeddings + positional encodings) is first processed by encoder layer 1. The output of layer 1, which has the same dimensionality as the input, serves as the input to layer 2, and so on, up to layer N. The output tensor from the final encoder layer (layer N) encapsulates a rich representation of the entire input sequence. This final encoder output is then critically used as the Key (K) and Value (V) inputs for the cross-attention sub-layer within each of the N decoder layers.
- Decoder Stack: Similarly, the decoder stack processes the target sequence embeddings (plus positional encodings). The output from decoder layer i becomes the input for decoder layer i+1. Each decoder layer performs masked self-attention on the target sequence, followed by cross-attention with the final encoder output, and finally passes the result through a position-wise feed-forward network. The output of the final decoder layer (layer N) is then fed into the final linear transformation and softmax layer to produce output probabilities over the vocabulary.
Data flow through stacked encoder and decoder layers. The final encoder output provides context (Keys and Values) to all decoder layers via the cross-attention mechanism.
Enabling Deep Stacks: Residuals and Normalization
Simply stacking layers can lead to training difficulties, particularly the vanishing gradient problem common in deep networks where gradients become too small to effectively update weights in earlier layers. The Transformer architecture incorporates two important mechanisms within each layer to mitigate this:
- Residual Connections (Add): Each sub-layer (self-attention, feed-forward) has a residual connection around it. The input to the sub-layer, x, is added to the output of the sub-layer after it passes through dropout, Sublayer(x). This creates a direct path, or "shortcut," for the gradient to flow backward through the network. This significantly eases the optimization of deep models by ensuring that gradients are not diminished excessively as they propagate back through many layers. The operation is formally defined as LayerNorm(x+Dropout(Sublayer(x))) in the original Post-LN formulation.
- Layer Normalization (Norm): Applied within the residual connection path (either before the sub-layer in Pre-LN variants, or after the addition in the original Post-LN formulation). Layer Normalization stabilizes the activations within each layer by normalizing the features across the embedding dimension for each position independently. This helps prevent exploding or vanishing activations, reduces sensitivity to initialization, and generally allows for faster and more stable training, especially with deeper stacks.
Without these components, training Transformers with a significant number of layers (e.g., N>2) would be extremely challenging, if not impossible. They ensure that information and gradients can propagate effectively even through dozens of stacked layers.
Impact of Model Depth
Increasing the number of layers, N, directly impacts the model:
- Performance: Generally, deeper models (larger N) achieve better performance on complex sequence tasks, up to a point where diminishing returns or optimization difficulties set in. The optimal depth often depends on the task complexity, the amount of available training data, and the computational budget.
- Computational Cost: Both training and inference time scale approximately linearly with N. Doubling the number of layers roughly doubles the computation required for a forward pass through the encoder and decoder stacks.
- Parameters: The total number of parameters also scales linearly with N, assuming each layer has an identical structure (which is standard). This increases memory requirements for storing the model checkpoints and activations during training.
The choice of N is a fundamental hyperparameter in Transformer design. While the original paper used N=6, modern large language models often employ much deeper stacks (e.g., N=24,48,96, or even more). This increase in depth has been enabled by access to vast datasets, significant computational resources, and continuous refinements in architectural details (like the widespread adoption of Pre-LN normalization for improved stability) and training techniques.
In summary, stacking multiple identically structured encoder and decoder layers is how Transformers achieve the depth necessary for high performance. This depth allows for hierarchical processing of sequence information and provides the required model capacity for complex sequence modeling tasks. The successful training of these deep stacks relies heavily on the careful integration of residual connections and layer normalization within each constituent block.