The encoder stack is composed of multiple identical layers, often denoted as N layers (e.g., N=6 in the original Transformer paper). Each encoder layer has the primary responsibility of transforming a sequence of input embeddings into a sequence of contextualized representations. These representations incorporate information from the entire input sequence, allowing the model to understand the role of each element within its context.
An individual encoder layer is built around two main sub-layers:
Crucially, each of these sub-layers is wrapped within a residual connection followed by layer normalization. This "Add & Norm" pattern is a characteristic feature of the Transformer architecture and is fundamental for training deep models effectively. Let's examine the structure and data flow within a single encoder layer.
Assume the input to the encoder layer is a sequence of vectors X=(x1,x2,...,xn), where n is the sequence length and each xi is a vector (e.g., the sum of token embeddings and positional encodings for the first layer, or the output of the previous encoder layer).
Multi-Head Self-Attention: The input sequence X first passes through the multi-head self-attention sub-layer. As detailed in Chapter 3, this mechanism allows each position i in the sequence to attend to all positions (including itself) in the sequence X. It computes attention scores based on queries, keys, and values derived from X itself, producing an output sequence where each vector is a weighted sum of value vectors, reflecting contextual information. Let the output of this sub-layer be denoted as MultiHeadAttention(X).
Add & Norm (First Block): The output of the self-attention sub-layer is then combined with the original input X through a residual connection (addition). This helps prevent the vanishing gradient problem in deep networks by allowing gradients to flow directly through the network. Dropout is typically applied to the output of the self-attention sub-layer before the addition step for regularization. Following the addition, layer normalization is applied. Layer normalization stabilizes the activations and improves training dynamics. The operation for this block can be represented as:
SubLayerOutput1=LayerNorm(X+Dropout(MultiHeadAttention(X)))The result, SubLayerOutput1, is an intermediate sequence of representations of the same dimension as X.
Position-Wise Feed-Forward Network (FFN): This intermediate sequence SubLayerOutput1 is then fed into the position-wise feed-forward network. This network consists of two linear transformations with an activation function in between (typically ReLU or GELU). Importantly, the same FFN (with the same weights) is applied independently to each position i in the sequence SubLayerOutput1. It provides a non-linear transformation, further processing the representations. Let the output be denoted as FFN(SubLayerOutput1). The structure is often:
FFN(z)=max(0,zW1+b1)W2+b2where z is the input vector for a specific position, and W1,b1,W2,b2 are the learnable parameters of the two linear layers. The inner dimension is typically larger than the model's embedding dimension dmodel.
Add & Norm (Second Block): Similar to the first sub-layer, the output of the FFN is combined with its input (SubLayerOutput1) via a residual connection, followed by dropout and layer normalization.
LayerOutput=LayerNorm(SubLayerOutput1+Dropout(FFN(SubLayerOutput1)))The LayerOutput is the final output sequence of vectors for this encoder layer. This output has the same dimensions as the input X and serves as the input to the next identical encoder layer in the stack.
The following diagram illustrates the structure of a single encoder layer following this description (often referred to as Post-LN, where normalization happens after the addition):
Structure of a standard Transformer encoder layer (Post-LN variant). Input flows through Multi-Head Attention, is added to the residual input, and normalized. This result then flows through the Feed-Forward Network, is again added to its residual input (the output of the first Norm), and normalized again to produce the layer's output. Dropout is applied before each addition.
It's worth noting a common architectural modification known as the Pre-LN Transformer. In this variant, the layer normalization step is applied before the input enters each sub-layer (self-attention and FFN), while the residual connection adds the sub-layer's output directly to its input.
The flow for Pre-LN would look like this:
Pre-LN often leads to more stable training, especially for very deep Transformers, and may require less careful learning rate warmup compared to the original Post-LN structure. Understanding both configurations is valuable, as implementations and research papers might use either variant.
The output of the final encoder layer in the stack (N-th layer) serves as the key (K) and value (V) inputs for the cross-attention mechanism in each layer of the decoder stack, which we will discuss next.
© 2025 ApX Machine Learning