In the previous chapter, we established the general attention mechanism as a powerful technique for mapping a query and a set of key-value pairs to an output. Recall the core idea: the output is a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function (like scaled dot-product) between the query and the corresponding key.
Now, we focus on a specialized application of this mechanism known as self-attention. The defining characteristic of self-attention is that the queries, keys, and values all originate from the same source sequence. Instead of relating two different sequences (like in traditional encoder-decoder attention), self-attention allows a sequence to relate different positions within itself. This process enables the model to weigh the importance of other words in the sequence when encoding a specific word, capturing intra-sequence dependencies directly.
Consider an input sequence represented by a matrix X, where each row corresponds to a token's embedding (potentially combined with positional information, as we'll discuss in Chapter 4). To compute self-attention, we first project this input X into three distinct representations: Queries (Q), Keys (K), and Values (V). This is typically achieved using learned linear transformations (weight matrices WQ, WK, and WV):
Q=XWQ K=XWK V=XWVHere, WQ, WK, and WV are parameter matrices learned during training. The dimensions of these matrices allow the model to project the input embeddings into spaces suitable for calculating attention scores and weighted values.
Once we have Q, K, and V, the attention scores are computed using the scaled dot-product attention mechanism introduced previously:
Attention(Q,K,V)=softmax(dkQKT)VWhere dk is the dimension of the key vectors. The crucial point remains: Q, K, and V are all derived from the same input X. The output of this operation is a new sequence representation where each position's vector is a weighted combination of value vectors from all positions in the original sequence, based on the query-key similarities.
Derivation of Queries, Keys, and Values from the same input sequence X using distinct linear projections, followed by the scaled dot-product attention calculation.
For instance, in processing the sentence "The animal didn't cross the street because it was too tired", self-attention allows the model to learn that "it" refers to "animal" rather than "street". The query associated with "it" would attend strongly to the keys associated with "animal" and "street", but the compatibility function (learned via WQ and WK) combined with the softmax would assign a higher weight to the value associated with "animal".
While powerful for capturing relationships within a sequence, relying on a single set of projections (WQ,WK,WV) might force the attention mechanism to average potentially conflicting or distinct types of relationships. It might struggle to simultaneously focus on, for example, syntactic dependencies and semantic similarities using only one attention calculation. This limitation motivates the development of Multi-Head Attention, where we perform self-attention multiple times in parallel with different learned projections.
© 2025 ApX Machine Learning