As discussed, traditional sequence-to-sequence models often compress the entire input sequence into a single fixed-size context vector. This approach can struggle to retain information from longer sequences, creating an information bottleneck. The attention mechanism provides a more flexible alternative, allowing the model to look back at the entire input sequence and selectively focus on the parts most relevant for generating the current output.
To formalize this selective focus, the attention mechanism employs an abstraction based on three components: Queries, Keys, and Values. Imagine you're researching a topic in a digital library.
The attention mechanism works similarly: for a given Query, it compares the Query against all available Keys to determine how well they match. This matching process generates a set of scores, often called attention weights. These weights indicate the relevance of each Key (and its corresponding Value) to the Query. Finally, the mechanism computes an output by aggregating the Values, weighting each Value according to its calculated attention score. Values corresponding to Keys that strongly match the Query contribute more to the final output.
In the context of neural networks, Queries, Keys, and Values are represented as vectors derived from the model's internal representations (embeddings or hidden states):
Flow of attention using the Query, Key, and Value abstraction. The Query is compared against all Keys to compute weights, which are then used to form a weighted sum of the corresponding Values.
The core idea is that the Query interacts with the Keys to understand where to focus, and the resulting attention scores determine how much of each Value contributes to the final output representation. This output is a context-aware vector that summarizes the input sequence elements relevant to the specific Query.
These Query, Key, and Value vectors typically have dimensions denoted as dq, dk, and dv respectively. The compatibility between a Query and a Key is often computed using their vector representations, commonly via dot products, which requires dq=dk. The dimension dk plays a significant role in how attention scores are calculated and scaled, as we will see shortly. The dimension of the Value vectors, dv, determines the dimension of the output context vector before any final transformations.
It's important to note that the source of these Q, K, and V vectors defines the type of attention. In self-attention, which is fundamental to the Transformer architecture, Q, K, and V are all derived from the same sequence. This allows different positions within a single sequence to attend to each other. In cross-attention, commonly found in encoder-decoder architectures, the Query might originate from the decoder, while the Keys and Values come from the encoder's output, enabling the decoder to focus on relevant parts of the input sequence. For now, we focus on the general QKV abstraction itself.
This Query-Key-Value framework provides a powerful and flexible way to model dependencies regardless of their distance in the input sequence, directly addressing the limitations of fixed context vectors. The next step is to examine the specific mathematical operations used to implement this comparison and weighting process, starting with the Scaled Dot-Product Attention mechanism.
© 2025 ApX Machine Learning