While the masked self-attention mechanism allows the decoder to consider previous tokens within the sequence it's generating, it lacks direct access to the information encoded from the input sequence. To bridge this gap and enable the decoder to selectively focus on relevant parts of the source information, the Transformer architecture incorporates a second attention mechanism within each decoder layer: encoder-decoder cross-attention.
Unlike self-attention where Queries (Q), Keys (K), and Values (V) all originate from the same sequence (either the input sequence in the encoder or the partially generated output sequence in the decoder's masked self-attention), cross-attention draws its components from different sources:
This structure allows the decoder, at each step, to pose a query based on its current context (the generated sequence so far) and match it against the keys representing the entire input sequence. The resulting attention scores determine how much weight to give to the corresponding values from the encoder's output when constructing the decoder's output for that step.
The computation itself uses the same scaled dot-product attention function discussed previously. However, the inputs clearly distinguish the sources:
Attention(Qdec,Kenc,Venc)=softmax(dkQdecKencT)VencHere:
The softmax function ensures the weights assigned to the encoder's value vectors sum to one, creating a weighted average based on relevance as determined by the query-key interactions.
Just like self-attention, cross-attention benefits from multiple attention heads operating in parallel. Each head applies separate linear projections to the incoming Qdec, Kenc, and Venc to map them into different representation subspaces.
headi=Attention(QdecWiQ,KencWiK,VencWiV)Where WiQ, WiK, and WiV are the learned projection matrices for head i.
The outputs of these independent heads are then concatenated and passed through a final linear projection layer, identical to the process in multi-head self-attention:
MultiHead(Qdec,Kenc,Venc)=Concat(head1,...,headh)WOUsing multiple heads allows the decoder to simultaneously attend to information from the encoder output based on different criteria or from different representational perspectives. For instance, in translation, one head might focus on syntactic alignment while another focuses on semantic correspondence.
The primary purpose of encoder-decoder cross-attention is to condition the decoder's generation process on the relevant parts of the input sequence. Without it, the decoder would only have access to the input information through the initial decoder state, lacking the ability to dynamically focus on specific source tokens as generation progresses.
Consider translating "The black cat" (input) to "Le chat noir" (output).
This dynamic focusing is fundamental to the Transformer's success in sequence-to-sequence tasks.
The cross-attention mechanism sits between the masked self-attention sub-layer and the position-wise feed-forward sub-layer within each decoder layer. Residual connections and layer normalization are applied around it, just as with the other sub-layers.
Simplified data flow for the Encoder-Decoder Cross-Attention sub-layer within a Transformer Decoder layer. Queries originate from the decoder's state, while Keys and Values come from the encoder's final output.
Understanding this cross-attention mechanism is important for grasping how the Transformer decoder effectively utilizes the encoded representation of the input sequence to guide the generation of the output sequence. It acts as the primary bridge connecting the encoder's processing of the source sequence to the decoder's step-by-step generation of the target sequence.
© 2025 ApX Machine Learning