Building upon the concept of using dot products between Query (Q) and Key (K) vectors to measure alignment, we arrive at a refinement critical for stable and effective training: Scaled Dot-Product Attention.
The core dot-product attention mechanism computes QKT. Consider the dimensions involved. If the query and key vectors have dimension dk, the dot product between a single query q and a single key k is ∑i=1dkqiki. If we assume the components of q and k are independent random variables with zero mean and unit variance, the expected value of the dot product is 0, but its variance is dk.
In practice, especially for large values of dk (e.g., 64, 128, or higher, common in Transformers), the magnitude of these dot products can become significantly large. Why is this problematic? These dot products are the inputs to the softmax function, which calculates the attention weights.
softmax(z)i=∑jezjezi
The softmax function exhibits saturation behavior. If its inputs (the dot product scores) have large magnitudes (either large positive or large negative), the function enters regions where its gradient is extremely close to zero. Visualize the derivative of the softmax function; it approaches zero as the input values move away from zero.
The gradient of the softmax function is largest near input values of zero and diminishes rapidly for large positive or negative inputs. Large dot products push the computation into these low-gradient regions.
Small gradients drastically slow down or even halt the learning process during backpropagation, as weight updates become vanishingly small. This makes training deep models very difficult.
To counteract this effect, the "Attention Is All You Need" paper introduced a scaling factor of 1/dk. The complete formula for Scaled Dot-Product Attention is:
Attention(Q,K,V)=softmax(dkQKT)V
Here, Q, K, and V are matrices representing the packed sets of queries, keys, and values, respectively. dk is the dimension of the key (and query) vectors.
By dividing the dot products QKT by dk, we effectively moderate their magnitude, keeping the inputs to the softmax function closer to a regime where gradients are more substantial. This scaling ensures that the variance of the inputs to the softmax remains approximately constant regardless of the choice of dk, contributing significantly to more stable training dynamics. It prevents the attention weights from becoming overly peaky (concentrated on a single input) or overly diffuse too early in training simply due to the dimensionality of the representations.
This scaling is not just a minor implementation detail; it's a crucial component that enables the successful training of deep Transformer models by mitigating potential gradient issues stemming directly from the core attention calculation. It allows the model to learn effectively even when using high-dimensional vector representations for queries and keys.
© 2025 ApX Machine Learning