Okay, let's transition from the idea of attention to how it actually starts to work mechanically, at least from a high-level perspective. We know that attention aims to let the model focus on specific parts of the input sequence when generating an output. But how does the model decide which parts are important for a given step? This is where the calculation of attention scores comes in.
Imagine you're translating a sentence. When producing a particular word in the target language, you might need to look back at specific words in the source sentence. Attention formalizes this intuition. It calculates a set of attention scores that quantify the relevance of each input element to the current processing step.
To do this, we introduce three important concepts, represented as vectors derived from our input embeddings (we'll cover exactly how these are derived later):
Query (Q): Think of the Query vector as representing the current focus or question. For example, in a sequence-to-sequence model's decoder, the query might represent the state trying to predict the next word. It's asking, "Given my current state, which parts of the input sequence are most relevant right now?"
Key (K): Each element in the input sequence has an associated Key vector. You can think of the Key as a kind of label or identifier for that input element's content, used for matching against the Query. The Query will be compared against all Keys.
Value (V): Each input element also has a Value vector. This vector represents the actual content or meaning of that input element. Once we know how relevant each input element is (by comparing Q and K), the Value vector provides the information we'll actually use.
The process generally unfolds like this:
Calculate Similarity Scores: The model compares the Query vector (representing the current focus) with the Key vector of every element in the input sequence. This comparison yields a raw similarity score for each input element. A common way to calculate this similarity is using the dot product between the Query vector and each Key vector. A higher score implies greater relevance between the query's need and the key's description of an input element.
Normalize Scores into Weights: These raw similarity scores need to be normalized so they represent a distribution of importance. Typically, a softmax function is applied to the scores. The softmax function converts the scores into probabilities that all sum up to 1. These resulting values are called attention weights. An input element with a higher attention weight is considered more important for the current query.
This diagram illustrates the flow: a Query is compared against multiple Keys to produce similarity scores, which are then normalized (e.g., via Softmax) into attention weights.
These attention weights are the core output of this stage. They tell the model exactly how much "attention" or focus to pay to each element in the input sequence, based on the current Query. In the next section, we'll see how these weights are used with the Value vectors to create a single, informative context vector. This mechanism provides a dynamic and flexible way to handle dependencies in sequences, overcoming key limitations of fixed context vectors found in simpler RNN approaches.
© 2025 ApX Machine Learning