Okay, we've seen how attention mechanisms calculate scores, essentially assigning a relevance weight (α) to each input element based on a query. These scores tell us how much attention to pay to different parts of the input sequence. But what do we do with these scores? We use them to create a context vector.
Think of the context vector as a distilled summary of the input sequence, tailored specifically to the current focus or query. Instead of just compressing the entire input into a single, fixed-size vector like traditional RNNs often do, attention allows us to create a dynamic summary that emphasizes the most relevant parts.
The core idea is remarkably straightforward: the context vector is calculated as a weighted sum of the Value vectors (v) from the input sequence. The weights used in this sum are precisely the attention scores (α) we calculated previously.
Mathematically, if we have n input elements, each with a corresponding Value vector vi and an attention score αi (relative to the current query), the context vector C is computed as:
C=i=1∑nαiviLet's break this down:
Imagine you have three input words, each represented by a Value vector (v1,v2,v3), and you've calculated attention weights (α1,α2,α3) for them based on a specific query. The context vector C is formed by blending these vectors according to the weights.
The diagram shows how each Value vector (vi) is multiplied by its corresponding attention weight (αi). These weighted vectors are then summed together to produce the final context vector (C).
This mechanism is powerful because it allows the model to dynamically create a relevant summary of the input for each step of the output generation (or for each element being processed).
Consider machine translation. When translating a sentence, the meaning of a specific output word often depends heavily on one or two words in the input sentence, plus some broader context. A traditional RNN might struggle to retain precise information from distant input words due to its sequential nature and fixed-size hidden state.
Attention, however, calculates weights for all input words at each output step. The resulting context vector directly incorporates information from the most relevant input words, regardless of their position in the sequence. This overcomes the information bottleneck of fixed-size context vectors and allows models to handle long-range dependencies much more effectively.
This context vector, enriched with focused information from the input, becomes a primary input for the next stage of processing, such as generating the next word in a sequence or feeding into subsequent layers in the network. We'll see exactly how this fits into the larger Transformer architecture in the upcoming chapters.
© 2025 ApX Machine Learning