In Chapter 1, we examined the sequential nature of Recurrent Neural Networks (RNNs) and their variants like LSTMs and GRUs. While effective for many sequence tasks, a significant limitation arises from their core mechanism: compressing the information from an entire input sequence, regardless of its length, into a single fixed-size hidden state vector (or context vector in encoder-decoder setups).
Consider a standard sequence-to-sequence (seq2seq) model, often used for tasks like machine translation. The encoder processes the input sequence step-by-step, updating its hidden state at each step. The final hidden state of the encoder, often called the context vector, is intended to summarize the entire input sequence. This single vector is then passed to the decoder, which uses it as its initial state to generate the output sequence.
A simplified view of a traditional Encoder-Decoder architecture. The encoder outputs a single fixed-size context vector
C
representing the entire input, which becomes the sole source of input information for the decoder.
This fixed-size context vector represents an inherent information bottleneck. Imagine trying to summarize a lengthy paragraph or document into a single, short sentence; inevitably, details are lost. Similarly, forcing a model to encode all nuances of a long input sequence (e.g., a complex sentence in translation) into one vector becomes increasingly challenging as sequence length grows. The decoder's performance is fundamentally constrained by the quality and completeness of this single summary vector. Information from earlier parts of the input sequence might be "overwritten" or diluted by later inputs during the sequential processing of the encoder.
This bottleneck makes it difficult for the decoder to access specific, relevant pieces of information from the input when generating different parts of the output. For instance, when translating a sentence, the choice of a specific output word might depend heavily on a particular word or phrase near the beginning of the input sentence. Relying solely on the final compressed context vector makes accessing such specific, distant information unreliable.
To overcome this limitation, we need a mechanism that allows the model, particularly the decoder, to "look back" at the entire sequence of encoder hidden states (or representations of the input) at each step of the output generation process. Instead of relying on a single static summary, the model should be able to dynamically assign varying degrees of importance to different parts of the input sequence based on what it's trying to predict at that moment.
This is the core motivation behind the attention mechanism. It provides a way to create a dynamic, context-dependent summary of the input sequence tailored specifically for each output step. Rather than compressing everything into one fixed vector, attention allows the model to selectively focus on the most relevant input elements, effectively bypassing the fixed-context bottleneck.
The following sections will detail how this attention mechanism is implemented, starting with the fundamental abstraction of Queries, Keys, and Values, and leading to the widely adopted Scaled Dot-Product Attention formulation used within the Transformer architecture.
© 2025 ApX Machine Learning