As we discussed, Recurrent Neural Networks (RNNs), while effective for shorter sequences, struggle with longer ones. The primary difficulty lies in compressing the meaning of an arbitrarily long input sequence into a single, fixed-size hidden state vector. This vector becomes an information bottleneck, making it hard for the model to remember details from earlier in the sequence, especially when generating outputs later on. Imagine trying to summarize a long novel in a single sentence – important nuances are inevitably lost.
How do humans handle complex tasks like translation or summarization? We don't just read the entire source text and then write the translation from memory based on a single internal summary. Instead, we focus selectively. When translating a particular phrase, we might glance back at the corresponding words or phrases in the original text. We pay attention to the parts of the input that are most relevant to the specific output we are generating at that moment.
This is the core idea behind the attention mechanism in neural networks. Instead of forcing the model to rely solely on a compressed representation of the entire input, attention provides a way for the model to look back at the full input sequence at each step of generating the output. It learns to dynamically assign importance scores (attention weights) to different parts of the input based on their relevance to the current task.
Think of the model generating an output sequence, one element (e.g., a word) at a time. For each output element it needs to produce, the attention mechanism allows it to:
context vector
– a special summary of the input sequence, tailored specifically for the current output step, emphasizing the most relevant input parts.This process allows the model to selectively focus its "attention" on different parts of the input sequence as needed, overcoming the fixed-size bottleneck of traditional encoder-decoder models based purely on RNNs.
The decoder uses its current state to generate a 'Query'. This query is compared against 'Keys' derived from all input representations to compute attention scores. These scores weight the corresponding 'Values' (also derived from inputs) to create a context vector, which informs the decoder's next step.
To formalize this process, attention mechanisms typically operate using three types of vectors derived from the sequence representations:
Essentially, the mechanism performs a lookup. The Query searches across all available Keys. The degree of match between the Query and a Key determines the weight assigned to the corresponding Value. All weighted Values are then aggregated.
This ability to dynamically weigh input information based on relevance is a significant step forward from relying on a single, static context vector derived from the final RNN state. It allows models to handle dependencies across much longer distances in the input and output sequences.
In the following sections, we will look more closely at how these Query, Key, and Value vectors are typically generated and how the attention scores and context vectors are calculated using these components.
© 2025 ApX Machine Learning