Having defined the mathematical structure of sinusoidal positional encodings in the previous section using:
PE(pos,2i)=sin(pos/100002i/dmodel) PE(pos,2i+1)=cos(pos/100002i/dmodel)
Let's examine why this particular formulation is effective and frequently used. These fixed, non-learned encodings possess several desirable properties that integrate well with the Transformer architecture.
Unlike learned positional embeddings which require training, sinusoidal encodings are generated by a fixed function. This means they are deterministic: for a given position pos
and dimension index i
, the value is always the same. It also requires no additional trainable parameters specifically for position information.
Furthermore, the combination of sine and cosine functions across different frequencies (determined by the 100002i/dmodel term) ensures that each position pos
within a reasonable sequence length receives a unique encoding vector PEpos∈Rdmodel. While collisions are theoretically possible for extremely long sequences, they are practically nonexistent within typical model constraints.
The sine and cosine functions naturally produce values within a fixed range, [−1,1]. When these positional encodings are added to the token embeddings (which are also typically within a managed range, often through normalization or initialization), this boundedness prevents the positional information from drastically altering the magnitude of the combined embedding. This contributes to more stable training dynamics compared to potentially unbounded positional signals.
Perhaps the most significant advantage of sinusoidal encodings is their inherent ability to represent relative positions through linear transformations. Consider the encoding for position pos+k. Using trigonometric sum identities:
sin(a+b)=sin(a)cos(b)+cos(a)sin(b) cos(a+b)=cos(a)cos(b)−sin(a)sin(b)
Let ωi=1/100002i/dmodel. The components of PEpos+k can be written in terms of PEpos:
PE(pos+k,2i)=sin((pos+k)ωi)=sin(pos⋅ωi)cos(k⋅ωi)+cos(pos⋅ωi)sin(k⋅ωi) =PE(pos,2i)cos(k⋅ωi)+PE(pos,2i+1)sin(k⋅ωi)
PE(pos+k,2i+1)=cos((pos+k)ωi)=cos(pos⋅ωi)cos(k⋅ωi)−sin(pos⋅ωi)sin(k⋅ωi) =PE(pos,2i+1)cos(k⋅ωi)−PE(pos,2i)sin(k⋅ωi)
This can be expressed as a matrix multiplication for each pair of dimensions (2i,2i+1):
(PE(pos+k,2i)PE(pos+k,2i+1))=(cos(kωi)−sin(kωi)sin(kωi)cos(kωi))(PE(pos,2i)PE(pos,2i+1))This demonstrates that the positional encoding for PEpos+k is a linear function (specifically, a rotation) of PEpos. The transformation matrix depends only on the offset k, not the absolute position pos. This property makes it easier for the self-attention mechanism, which involves linear projections (Queries, Keys, Values) and dot products, to learn to attend based on relative distances between tokens. The model doesn't need to learn separate rules for an offset of +2 at position 5 versus position 50; the relationship is encoded consistently.
The sinusoidal functions vary smoothly with position. This means that the positional encodings for adjacent positions pos and pos+1 are similar, reflecting the intuition that adjacent words often have closely related contextual roles. This smooth change contrasts with potentially abrupt changes that could occur with less structured encoding schemes.
Because sinusoidal encodings are generated by a fixed function rather than learned from data within a fixed sequence length range, they offer a theoretical advantage for handling sequences longer than those encountered during training. The function can generate encodings for any position pos. While the model's overall performance might still degrade on much longer sequences due to other factors (like attention patterns not generalizing), the positional encoding mechanism itself doesn't inherently break down or produce undefined values for unseen positions, unlike learned embeddings which would lack representations for positions beyond the training maximum.
The choice of frequencies (ωi=1/100002i/dmodel) results in signals that range from high frequency (for small i, varying rapidly with position) to very low frequency (for large i, varying slowly across the sequence). This allows the model to potentially capture positional information at different resolutions.
Heatmap showing the values of sinusoidal positional encodings for the first 30 positions (
pos
) and the first 6 dimensions (d=0
tod=5
) of a 128-dimensional embedding (d_model=128
). Notice the faster oscillation in lower dimensions (top rows) and slower oscillation in higher dimensions (bottom rows).
In summary, sinusoidal positional encodings provide a simple yet effective parameter-free method to inject sequence order information into the Transformer. Their mathematical properties align well with the attention mechanism's ability to model relationships between tokens, particularly concerning relative positioning, while maintaining stability and offering potential for generalization to longer sequences.
© 2025 ApX Machine Learning