While the encoder processes the input sequence into a rich representation, the decoder's role is typically to generate an output sequence, one element at a time (autoregressively). To achieve this, each decoder layer incorporates information from both the previously generated output elements and the final representation produced by the encoder stack.
A single decoder layer is composed of three distinct sub-layers, each followed by a residual connection and layer normalization step, mirroring the structure seen in the encoder but with significant differences in the attention mechanisms.
The components within a standard Transformer decoder layer are:
Masked Multi-Head Self-Attention: This sub-layer allows each position in the decoder input (the partially generated output sequence) to attend to all positions up to and including itself within the decoder input. The important addition here is the "masking." During training and inference for sequence generation, the decoder must only rely on previously generated tokens. Masking ensures that the self-attention mechanism cannot "look ahead" at future tokens in the output sequence, preserving the autoregressive property necessary for generating sequences step-by-step. This is typically achieved by adding a large negative value (approaching negative infinity) to the attention scores corresponding to future positions before the softmax operation, effectively zeroing out their probabilities. The underlying multi-head attention computation is otherwise identical to that described in Chapter 3.
Multi-Head Cross-Attention (Encoder-Decoder Attention): This is where the decoder integrates information from the input sequence. Unlike self-attention where Queries (Q), Keys (K), and Values (V) all derive from the same sequence, cross-attention operates differently. The Queries (Q) are derived from the output of the previous decoder sub-layer (the masked self-attention layer). However, the Keys (K) and Values (V) come directly from the output of the final layer of the encoder stack. This mechanism enables each position in the decoder's output sequence to attend to all positions in the input sequence, allowing it to focus on the most relevant parts of the input context when generating the next output token. This sub-layer is fundamental for tasks like machine translation where alignment between input and output words is needed.
Position-wise Feed-Forward Network (FFN): Identical in structure and function to the FFN found in the encoder layer, this sub-layer consists of two linear transformations with a non-linear activation function (commonly ReLU or GeLU) in between. It is applied independently to each position vector emerging from the cross-attention sub-layer. This provides additional modeling capacity and allows the network to process the information gathered from the attention mechanisms more effectively.
Each of these three sub-layers has a residual connection around it, followed by layer normalization. The formula for a sub-layer output can be represented as:
LayerNorm(x+Sublayer(x))where x is the input to the sub-layer and Sublayer(x) is the function implemented by the sub-layer itself (e.g., masked self-attention, cross-attention, or FFN). These Add & Norm steps are essential for training deep Transformer models by improving gradient flow and stabilizing layer inputs.
The following diagram illustrates the data flow within a single decoder layer:
Data flow within a standard Transformer decoder layer. Input represents the target sequence embeddings (plus positional encoding) or the output from the previous decoder layer. Encoder Output provides Keys and Values for cross-attention. Each sub-layer (Masked Self-Attention, Cross-Attention, FFN) is followed by Add & Norm.
Understanding this layered structure is fundamental. The masked self-attention processes the generated sequence so far, the cross-attention integrates context from the input sequence via the encoder, and the feed-forward network provides further transformation. Multiple identical decoder layers are then stacked to form the complete decoder component of the Transformer. Subsequent sections will examine the specifics of masked attention, cross-attention, FFNs, and the Add & Norm operations in greater detail.
© 2025 ApX Machine Learning