The calculation of scaled dot-product scores between Queries (Q) and Keys (K) produces a matrix of raw alignment scores. This computation yields dkQKT. Although these scores reflect the compatibility between query and key vectors, they are unnormalized and can span any range of values, making them difficult to interpret directly as contribution weights.
To transform these raw scores into a usable set of weights that represent a distribution of attention, we apply the softmax function independently to each row of the score matrix. For a specific query qi (corresponding to the i-th row of Q), the raw score for its alignment with key kj (corresponding to the j-th column of KT) is denoted as sij=dkqikjT. The softmax function converts the vector of scores si=[si1,si2,...,siN] for query qi across all N keys into a vector of attention weights αi=[αi1,αi2,...,αiN] where each weight αij is calculated as:
αij=softmax(sij)=∑l=1Nexp(sil)exp(sij)
Here, N represents the sequence length of the Key/Value pairs.
Properties and Interpretation
Applying the softmax function yields several important properties for the resulting attention weights αij:
Normalization: The denominator ∑l=1Nexp(sil) ensures that the sum of all attention weights for a given query qi across all keys equals 1. That is, ∑j=1Nαij=1. This allows us to interpret the weights as a probability distribution.
Non-negativity: Since the exponential function exp(x) is always positive for any real input x, each individual attention weight αij is guaranteed to be positive.
Probabilistic Interpretation: The set of weights {αi1,αi2,...,αiN} for a query qi represents the probability distribution of attention over the input sequence. The weight αij indicates the proportion of attention the model assigns to the j-th input element (represented by its value vector vj) when computing the output representation for the i-th position.
Highlighting Importance: The exponential function inherently amplifies larger scores more than smaller ones. If one score sik is significantly larger than others in the row si, its corresponding weight αik will be close to 1, while others will be close to 0. This mechanism allows the model to sharply focus on the most relevant input elements identified by the dot-product scoring.
Role in Computing the Final Output
These calculated attention weights αij are the coefficients used to compute a weighted sum of the Value vectors (V). The attention mechanism's output for the i-th query is obtained by multiplying the attention weight distribution αi with the Value matrix V:
outputi=j=1∑Nαijvj
In matrix notation, this corresponds directly to the final step in the scaled dot-product attention formula:
Attention(Q,K,V)=AV=softmax(dkQKT)V
Where A is the attention weight matrix whose elements are Aij=αij. The softmax function is thus fundamental for converting the raw similarity scores into a normalized distribution that dictates how information from different parts of the input sequence (represented by the Value vectors) is aggregated to form the output.
Visualization Example
Consider a simplified case with 4 raw alignment scores for a single query: s=[1.0,0.5,2.5,−0.1]. Applying the softmax function transforms these scores:
exp(1.0)≈2.718
exp(0.5)≈1.649
exp(2.5)≈12.182
exp(−0.1)≈0.905
The sum of exponentials is 2.718+1.649+12.182+0.905=17.454.
The resulting softmax weights are:
α1=2.718/17.454≈0.156
α2=1.649/17.454≈0.094
α3=12.182/17.454≈0.698
α4=0.905/17.454≈0.052
Notice how the highest raw score (2.5) corresponds to the dominant attention weight (0.698), effectively focusing the attention mechanism on the third input element in this example. The weights sum to approximately 1 (0.156+0.094+0.698+0.052=1.000).
Comparison of raw alignment scores and the resulting attention weights after applying the softmax function for a single query. Softmax converts scores into a probability distribution, highlighting the most relevant positions.
In summary, the softmax function acts as an important normalization step within the attention mechanism. It transforms raw alignment scores into a probability distribution, enabling the model to selectively weight and combine information from the Value vectors based on Query-Key similarities, forming the foundation for how Transformers process sequential information.
Was this section helpful?
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (Curran Associates, Inc.)DOI: 10.48550/arXiv.1706.03762 - Original paper introducing the Transformer architecture and the scaled dot-product attention mechanism, including the role of softmax.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Foundational textbook providing a detailed mathematical explanation of the softmax function and its properties.