Okay, let's dive into the core calculation mechanism for self-attention: Scaled Dot-Product Attention. This is the engine that computes how much each element in a sequence should attend to every other element (including itself). Recall from the previous section that we transform our input embeddings into three distinct vectors: Query (Q), Key (K), and Value (V). Think of it this way:
Query (Q): Represents the current word or position asking for information. "What should I pay attention to?"
Key (K): Represents all the words or positions offering information, acting like labels or identifiers. "Here's what I represent."
Value (V): Represents the actual content or meaning associated with each Key. "Here's the information I hold."
The goal is to use the Query of a specific position to check its compatibility with the Keys of all positions, and then use these compatibility scores to create a weighted sum of the Values. This weighted sum becomes the output representation for the Query position, enriched with context from the entire sequence.
The Scaled Dot-Product Attention Formula
The computation is elegantly captured in a single formula, originally presented in the "Attention Is All You Need" paper:
Attention(Q,K,V)=softmax(dkQKT)V
Let's break this down step by step:
Calculate Similarity Scores (QKT)
The first step is to compute the dot product between the Query vector of the position we're focusing on and the Key vectors of all positions in the sequence. If we represent all Queries as a matrix Q and all Keys as a matrix K, we can compute all these scores simultaneously using matrix multiplication: QKT.
Why dot products? The dot product measures similarity or alignment between two vectors. A larger dot product between a query qi and a key kj suggests that position j is more relevant to position i.
Dimensions: If Q has dimensions (seq_len,dk) and K has dimensions (seq_len,dk), their transpose KT has dimensions (dk,seq_len). The resulting matrix QKT will have dimensions (seq_len,seq_len), where the element at (i,j) represents the raw attention score from position i to position j.
Scale the Scores (dk...)
The next step is to scale down these scores by dividing by the square root of the dimension of the Key vectors, dk.
Why scaling? The dot products can become very large, especially if the dimension dk is large. Large inputs to the softmax function can result in extremely small gradients, making training unstable or slow. Dividing by dk counteracts this effect, keeping the variance of the inputs to the softmax more controlled. This helps stabilize the training process.
dk: This is the dimensionality of the key (and query) vectors. In many Transformer implementations, the query, key, and value vectors might have the same dimensionality, but the scaling factor specifically relates to the dimension used for the dot product calculation (dk).
Apply Softmax (softmax(...))
The scaled scores are then passed through a softmax function. The softmax is applied row-wise to the matrix of scaled scores.
Purpose: Softmax converts the scores into positive values that sum up to 1. Each value in the resulting matrix can be interpreted as an attention weight. The weight at position (i,j) indicates the proportion of attention the Query at position i should pay to the Value at position j. Higher weights mean higher importance.
Flow diagram illustrating the steps involved in Scaled Dot-Product Attention, from input Query, Key, and Value matrices to the final output context vector.
Multiply by Value Matrix (...V)
Finally, the matrix of attention weights (output of the softmax) is multiplied by the Value matrix V.
Result: This step computes a weighted sum of the Value vectors. For each position i, the output vector zi is calculated as ∑jattention_weightijvj. Essentially, the Value vector vj contributes to the output zi proportionally to how much attention position i decided to pay to position j.
The final output matrix has dimensions (seq_len,dv), where dv is the dimension of the value vectors. Each row of this matrix is a context-aware representation of the corresponding input element, incorporating information from the entire sequence based on the calculated attention weights.
Intuition and Summary
Scaled Dot-Product Attention provides an efficient way for the model to weigh the importance of different elements in a sequence relative to each other. By comparing each element's Query with every other element's Key, it generates attention scores. These scores, after scaling and normalization via softmax, dictate how to combine the Value vectors into a refined representation for each element. This mechanism allows the model to dynamically focus on relevant parts of the input sequence when processing any given part, forming the foundation for how Transformers understand context within text.
In the next section, we will explore how this mechanism is extended using multiple "heads" to capture different types of relationships simultaneously.