Within the decoder stack, the self-attention mechanism operates differently than in the encoder. Recall that the encoder processes the entire input sequence simultaneously, allowing each position to attend to all other positions (including future ones relative to itself). This bidirectional context is beneficial for understanding the input sequence structure.
However, the decoder's primary role, especially in tasks like machine translation or text generation, is often autoregressive. This means it generates the output sequence one token at a time, from left to right. When predicting the token at position i, the decoder should only have access to the previously generated tokens (positions 1 to i−1) and the complete encoded input sequence. It must not be allowed to "look ahead" at tokens in positions i,i+1,… in the target sequence it is currently generating. Allowing such access would make the generation task trivial during training, as the model could simply copy the next token instead of learning to predict it.
To enforce this unidirectional information flow, the decoder employs masked self-attention. The core idea is to modify the standard scaled dot-product attention calculation by masking out (setting to negative infinity) any attention scores that correspond to connections to future positions.
The masking occurs just before the softmax operation is applied to the scaled attention scores. Let the standard scaled dot-product attention scores be computed as:
scores=dkQKTWhere Q, K, and V are the query, key, and value matrices derived from the decoder's input (or the output of the previous decoder layer), and dk is the dimension of the keys.
A mask matrix, M, is created. This matrix typically has dimensions compatible with the attention scores (sequence length × sequence length). For a position i attending to position j:
This mask matrix M is then added to the attention scores:
masked_scores=scores+MFinally, the softmax function is applied to these masked scores:
AttentionWeights=softmax(masked_scores)The effect of adding −∞ is that after the exponentiation within the softmax function, these scores become e−∞=0. Consequently, the attention weights for future positions become zero, effectively preventing any information flow from those positions. The decoder position i can only attend to positions 1 through i.
A visual representation of the attention mask for a sequence of length 5. Blue cells (value 1) indicate positions the query (row) is allowed to attend to (key column). Gray cells (value 0) indicate masked positions (future tokens). Note that each position can attend to itself and all preceding positions.
This causal attention mechanism is fundamental for enabling the Transformer decoder to learn sequence generation tasks effectively. It ensures that the prediction for each step only depends on the known outputs from previous steps, mirroring the conditions during actual inference or generation. This contrasts sharply with the encoder's self-attention, which can freely incorporate information from the entire input sequence. The combination of masked self-attention (for processing the generated sequence so far) and cross-attention (for incorporating information from the encoder) allows the decoder to produce coherent and contextually relevant output sequences.
© 2025 ApX Machine Learning