While the encoder's job is to process the entire input sequence and build informative representations, the decoder's role is to generate the output sequence, one element at a time. This generation process is typically auto-regressive, meaning that when predicting the next element (like a word in a translation), the decoder considers the encoder's output and the elements it has already generated.
The decoder is also composed of a stack of N identical layers. The structure of each decoder layer is slightly more complex than an encoder layer because it needs to handle two types of attention mechanisms.
Let's break down the components within one decoder layer. The input to the first decoder layer is the target sequence embeddings (shifted right, as we'll discuss) plus positional encodings. The inputs to subsequent layers are the outputs from the preceding decoder layer. Crucially, each decoder layer also receives the Key (K) and Value (V) matrices from the final encoder layer.
A single decoder layer consists of three main sub-layers, each followed by an "Add & Norm" step (residual connection plus layer normalization):
Masked Multi-Head Self-Attention: This sub-layer operates similarly to the self-attention mechanism in the encoder, but with a significant modification: masking. During training, the decoder receives the complete target sequence as input (specifically, a version shifted one position to the right, so the prediction for position i relies only on outputs up to i−1). However, to maintain the auto-regressive property and prevent the model from "cheating" by looking ahead at future tokens it's supposed to predict, we apply an attention mask. This mask effectively sets the attention scores for subsequent positions to negative infinity before the softmax calculation, ensuring that the attention probability for future tokens becomes zero. This allows each position in the decoder to attend only to previous positions in the output sequence (including the current position). The Query (Q), Key (K), and Value (V) vectors for this sub-layer are derived from the output of the previous decoder layer (or the target sequence embeddings for the first layer).
Multi-Head Encoder-Decoder Attention (Cross-Attention): This is where the decoder interacts with the output of the encoder stack. This sub-layer allows the decoder to focus on relevant parts of the input sequence to help predict the next output token. Here's how it works:
Position-wise Feed-Forward Network: This sub-layer is identical in structure to the one found in the encoder layer. It consists of two linear transformations with a ReLU activation function in between:
FFN(x)=max(0,xW1+b1)W2+b2This network is applied independently to each position, processing the output from the encoder-decoder attention sub-layer. It provides additional non-linear transformations to further refine the representation at each position.
Each of these three sub-layers has a residual connection around it, followed by layer normalization, just like in the encoder. This helps with gradient flow and stabilizes the training of deep models.
Flow within a single Transformer Decoder layer. It includes masked self-attention, encoder-decoder attention (using output from the encoder), and a feed-forward network, each followed by Add & Norm.
The output of the final decoder layer in the stack represents the processed target sequence information, conditioned on both the input sequence (via the encoder) and the previously generated target tokens. This output tensor is then typically passed to a final linear layer and a softmax function to produce probability distributions over the vocabulary for the next token prediction, which we'll cover next.
© 2025 ApX Machine Learning