As established previously, the core self-attention mechanism within a Transformer doesn't inherently understand the order of tokens in a sequence. If you shuffled the input words, the self-attention output for a specific word (before positional information is added) would change, but the model wouldn't have a built-in way to know that the order itself was different. This is unlike Recurrent Neural Networks (RNNs), which process sequences step-by-step, inherently incorporating order.
To address this, Transformers inject information about the position of each token directly into the input embeddings. This is achieved using Positional Encoding. The idea is to create a unique vector for each position in the sequence and add this vector to the corresponding token's embedding. This combined embedding thus carries both the semantic meaning of the token and its position within the sequence.
While several methods for positional encoding exist, the original Transformer paper introduced a widely used technique based on sine and cosine functions of different frequencies. For a token at position pos in the sequence and dimension i within the embedding vector of size dmodel, the positional encoding PE is calculated as follows:
PE(pos,2i)=sin(100002i/dmodelpos) PE(pos,2i+1)=cos(100002i/dmodelpos)Let's break this down:
The core idea is that the wavelength of the sine/cosine waves varies across the dimensions. For small values of i (the initial dimensions), the wavelength is short (high frequency). For large values of i (the later dimensions), the wavelength becomes very long (low frequency, approaching a constant value for short sequences). Each position pos thus gets a unique combination of sinusoidal values across the dmodel dimensions.
The resulting positional encoding vector PE(pos) has the same dimension (dmodel) as the input embeddings. This positional encoding vector is then simply added element-wise to the token's input embedding:
final_embedding(pos)=input_embedding(pos)+PE(pos)This addition allows the model to learn representations that combine both the token's meaning and its position.
This specific sinusoidal approach has several appealing properties:
Here's a visualization of these positional encoding values across different positions and dimensions:
{"data":[{"type":"heatmap","z":[[1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0],[0.841,0.54,0.909,-0.416,0.956,-0.279,0.989,-0.141],[0.909,-0.416,0.141,-0.99,0.412,-0.911,0.657,-0.754],[0.141,-0.99,-0.757,-0.654,-0.279,-0.96,-0.0,-0.999],[-0.757,-0.654,-0.96,0.279,-0.842,0.54,-0.657,0.754],[-0.96,0.279,-0.279,0.96,-0.956,0.279,-0.989,0.141],[-0.279,0.96,0.657,0.754,-0.412,0.911,-0.0,1.0],[0.657,0.754,0.989,-0.141,0.0,-1.0,0.657,0.754],[0.989,-0.141,0.412,-0.911,0.842,-0.54,0.96,-0.279],[0.412,-0.911,-0.54,-0.842,0.996,0.09,0.279,-0.96]],"x":["Dim 0","Dim 1","Dim 2","Dim 3","Dim 4","Dim 5","Dim 6", "Dim 7"],"y":["Pos 0","Pos 1","Pos 2","Pos 3","Pos 4","Pos 5","Pos 6","Pos 7","Pos 8","Pos 9"],"colorscale":"Blues","colorbar":{"title":"PE Value"}},],"layout":{"title":"Positional Encoding Values (First 8 Dimensions, 10 Positions)","xaxis":{"title":"Embedding Dimension Index"},"yaxis":{"title":"Sequence Position","autorange":"reversed"},"margin":{"l":60,"r":10,"t":40,"b":40}}}
Visualization of positional encoding values. Each row represents a position in the sequence, and each column represents a dimension in the encoding vector. Notice the different frequencies of the sinusoidal patterns across dimensions and the unique vector for each position.
This injection of positional information happens only once, right at the beginning, before the input embeddings are fed into the first layer of the encoder (or decoder) stack. Subsequent layers within the Transformer then process these position-aware embeddings. Without this step, the Transformer would essentially be processing a "bag of words," losing the sequential nature of language or other ordered data.
© 2025 ApX Machine Learning