In the previous chapter, we explored the concept of attention as a way for models, particularly sequence-to-sequence models, to focus on relevant parts of the input when generating an output. This helped overcome some limitations of traditional Recurrent Neural Networks (RNNs), such as difficulty capturing long-range dependencies and their inherently sequential nature, which hinders parallelization.
Now, we shift our focus inward. Imagine processing a sentence like:
"The animal didn't cross the street because it was too tired."
To understand what "it" refers to, we intuitively look back at the preceding words. Is "it" the "animal" or the "street"? Clearly, "it" refers to "animal". How can a model learn this association? RNNs try to capture this context by passing information sequentially through their hidden states, but this becomes challenging over longer distances.
This is where self-attention comes into play. It's a mechanism that allows a model to weigh the importance of different words within the same input sequence when generating a representation for a specific word. Instead of just looking at the immediately preceding hidden state like an RNN, self-attention allows the model to look at the entire sequence simultaneously and ask, for each word, "Which other words in this sequence are most relevant to understanding this word?"
Think of it as allowing each word in the input sequence to interact directly with every other word, including itself. The model calculates an "attention score" between each pair of words. A high score between word A and word B suggests that word B is highly relevant for understanding or representing word A.
Consider the sentence "The cat sat on the mat". When processing the word "sat", self-attention might determine that "cat" and "mat" are particularly relevant, assigning them higher attention scores compared to "the" or "on".
Conceptual illustration of self-attention focusing on the word "sat". Arrows indicate attention flow, with potential scores highlighting relevance (e.g., "cat" and "mat" are highly relevant to "sat").
This is different from the attention we saw in basic sequence-to-sequence models, where a decoder typically attended to the encoder's output. Self-attention happens within a layer, looking at representations from the same layer (or the layer below it).
To achieve this, each input word representation (initially its embedding) conceptually takes on three roles, which we'll explore mathematically in the next section:
The model learns how to transform the input embeddings into these Query, Key, and Value representations. The Query of the word being processed is compared against the Keys of all words in the sequence. The results of these comparisons (the attention scores) are then used to create a weighted sum of the Value representations. This weighted sum becomes part of the new, context-aware representation for the original word.
The powerful aspect of self-attention is its ability to directly model relationships between any two words in the sequence, regardless of their distance. This resolves the long-range dependency issue much more effectively than sequential RNN processing. Furthermore, since the calculations for each word primarily involve matrix multiplications based on all words, they can be performed in parallel, leading to significant training speedups compared to RNNs.
We've now grasped the core idea: self-attention lets words within a sequence directly assess the relevance of other words to refine their own representation. Next, we'll look at the specific mechanism used in Transformers to calculate these attention scores: Scaled Dot-Product Attention.
© 2025 ApX Machine Learning