Building upon the Query (Q), Key (K), and Value (V) abstraction, we can now define the core computational step of one of the most common attention mechanisms: dot-product attention. At its heart, attention calculates a set of scores representing the relevance or compatibility between each query and every key. These scores are then used to create a weighted sum of the values, effectively allowing the model to focus on the most pertinent information carried by the values based on query-key interactions.
The "dot-product" name comes directly from how these compatibility scores are calculated. For a given query vector q and a set of key vectors K={k1,k2,...,km}, the score between q and a specific key kj is simply their dot product:
score(q,kj)=q⋅kj=qTkj
Intuitively, the dot product measures the alignment or similarity between two vectors. If a query vector q is highly similar (points in a similar direction) to a key vector kj, their dot product will be large. Conversely, if they are dissimilar or orthogonal, the dot product will be small or zero.
In practice, we don't compute these scores one by one. Deep learning frameworks excel at parallel processing, especially matrix multiplications. We typically work with matrices of queries, keys, and values:
Note that the query and key vectors must have the same dimension (dk) to allow for the dot product calculation. The value vectors can have a different dimension (dv).
To compute all query-key scores simultaneously, we perform a matrix multiplication between the query matrix Q and the transpose of the key matrix KT:
Scores=QKT
Let's examine the dimensions: Multiplying Q ([n×dk]) by KT ([dk×m]) results in a Score matrix of shape [n×m]. Each element (i,j) in this Scores matrix represents the dot product between the i-th query (Qi) and the j-th key (Kj), indicating their compatibility.
Matrix multiplication QKT computes all pairwise dot products between query vectors (rows of Q) and key vectors (columns of KT).
These raw scores (QKT) represent the fundamental compatibility measure in dot-product attention. However, they are not yet ready to be used as weights for the values. The range of dot products can be quite large, potentially leading to issues during training, especially with gradients. Furthermore, we need to convert these scores into a probability distribution that sums to 1 for each query, representing how much attention that query should pay to each value.
The next steps, scaling these scores and applying the softmax function, address these issues and transform the raw dot-product scores into usable attention weights. We will examine these steps in the following sections.
© 2025 ApX Machine Learning