Following the calculation of scaled dot-product scores between Queries (Q) and Keys (K), we obtain a matrix of raw alignment scores. As discussed previously, the computation yields dkQKT. While these scores reflect the compatibility between query and key vectors, they are unnormalized and can span any range of values, making them difficult to interpret directly as contribution weights.
To transform these raw scores into a usable set of weights that represent a distribution of attention, we apply the softmax function independently to each row of the score matrix. For a specific query qi (corresponding to the i-th row of Q), the raw score for its alignment with key kj (corresponding to the j-th column of KT) is denoted as sij=dkqikjT. The softmax function converts the vector of scores si=[si1,si2,...,siN] for query qi across all N keys into a vector of attention weights αi=[αi1,αi2,...,αiN] where each weight αij is calculated as:
αij=softmax(sij)=∑l=1Nexp(sil)exp(sij)Here, N represents the sequence length of the Key/Value pairs.
Applying the softmax function yields several important properties for the resulting attention weights αij:
These calculated attention weights αij are the coefficients used to compute a weighted sum of the Value vectors (V). The attention mechanism's output for the i-th query is obtained by multiplying the attention weight distribution αi with the Value matrix V:
outputi=j=1∑NαijvjIn matrix notation, this corresponds directly to the final step in the scaled dot-product attention formula:
Attention(Q,K,V)=AV=softmax(dkQKT)VWhere A is the attention weight matrix whose elements are Aij=αij. The softmax function is thus fundamental for converting the raw similarity scores into a normalized distribution that dictates how information from different parts of the input sequence (represented by the Value vectors) is aggregated to form the output.
Consider a simplified case with 4 raw alignment scores for a single query: s=[1.0,0.5,2.5,−0.1]. Applying the softmax function transforms these scores:
The sum of exponentials is 2.718+1.649+12.182+0.905=17.454.
The resulting softmax weights are:
Notice how the highest raw score (2.5) corresponds to the dominant attention weight (0.698), effectively focusing the attention mechanism on the third input element in this example. The weights sum to approximately 1 (0.156+0.094+0.698+0.052=1.000).
Comparison of raw alignment scores and the resulting attention weights after applying the softmax function for a single query. Softmax converts scores into a probability distribution, highlighting the most relevant positions.
In summary, the softmax function acts as a crucial normalization step within the attention mechanism. It transforms raw alignment scores into a probability distribution, enabling the model to selectively weight and combine information from the Value vectors based on Query-Key similarities, forming the foundation for how Transformers process sequential information.
© 2025 ApX Machine Learning