In self-attention, the objective is for each word in an input sequence to determine the relevance of every other word (including itself) within that specific context. To achieve this comparison and subsequent information weighting, we transform each input word's embedding into three distinct representations: the Query, Key, and Value vectors.
Consider an input sequence like "thinking machine". Each word is initially represented by an embedding vector. Let's denote the embedding for "thinking" as x1 and for "machine" as x2. These vectors hold the initial, context-independent meaning of the words.
For the self-attention calculation, we don't directly compare these embeddings. Instead, we project each embedding xi into three separate vector spaces using three unique weight matrices learned during training. These are commonly denoted as WQ, WK, and WV.
Specifically, for every input embedding xi in the sequence, we compute:
This process generates a unique set of query, key, and value vectors for each word in the input sequence. These derived vectors typically have a dimension (dk for keys and queries, dv for values) that might be smaller than the original embedding dimension (dmodel). The requirement is that the dimensions of Query and Key vectors (dk) must be identical to allow for their comparison via dot products.
We can visualize this transformation for a single input token:
Derivation of Query (q), Key (k), and Value (v) vectors from a single input embedding (xᵢ) using learned weight matrices (W^Q, W^K, W^V).
To understand the roles of these vectors, consider an analogy of searching a database:
Essentially, the interaction between a Query vector qi and a Key vector kj determines the strength of connection or attention weight from word i to word j. The Value vector vj provides the information that gets passed from word j back to word i, scaled by this attention weight.
An important aspect is that the weight matrices WQ, WK, and WV are parameters learned during the model training process. Initially random, they are adjusted through backpropagation so the model learns the most effective way to project input embeddings into these Q, K, and V spaces for the task at hand (e.g., machine translation, text summarization). This learning process allows the model to understand complex relationships and dependencies within the input sequence.
Having generated these Query, Key, and Value vectors for every word, we are now equipped to calculate the actual attention scores. The next section details how the Scaled Dot-Product Attention mechanism utilizes these vectors to compute the precise attention weights that define how information flows between words in the sequence.
© 2025 ApX Machine Learning