Masterclass
The self-attention mechanism, as we've seen, processes all input tokens simultaneously. While powerful for capturing dependencies regardless of distance, this parallel processing comes at a cost: the standard self-attention operation is permutation-invariant. If you shuffle the input tokens, the attention outputs (before adding positional information) would simply be a shuffled version of the original outputs. It has no inherent knowledge of the sequence order. "The cat sat on the mat" and "mat the on sat cat The" would look the same to the self-attention layer itself. Clearly, for language modeling and most sequence tasks, order is fundamental. We need a way to inject information about the position of each token into the model. This is achieved through positional encoding.
The core idea is to create a vector, the positional encoding, that represents the position of a token in the sequence. This vector is then added to the corresponding token's input embedding. This combined embedding, now containing both semantic information (from the token embedding) and positional information, is fed into the Transformer stack.
InputEmbeddingfinal=TokenEmbedding+PositionalEncoding
There are several ways to generate these positional encoding vectors.
Perhaps the most straightforward approach is to learn the positional encodings just like we learn token embeddings. We can define a maximum sequence length, say Lmax, and create an embedding matrix of size (Lmax,dmodel), where dmodel is the dimension of the model's embeddings. For a token at position pos (where 0≤pos<Lmax), we simply look up the pos-th vector in this embedding matrix and add it to the token's embedding.
In PyTorch, this can be implemented using nn.Embedding
:
import torch
import torch.nn as nn
# Example parameters
max_seq_len = 512
d_model = 768
# Learned Positional Embedding Layer
positional_embedding_table = nn.Embedding(max_seq_len, d_model)
# Example usage: Get embeddings for positions 0, 1, 2, ..., seq_len-1
seq_len = 100
positions = torch.arange(0, seq_len,
dtype=torch.long).unsqueeze(0) # Shape: (1, seq_len)
learned_pe = positional_embedding_table(positions)
# Shape: (1, seq_len, d_model)
print(f"Shape of learned positional embeddings: {learned_pe.shape}")
# Output: Shape of learned positional embeddings:
# torch.Size([1, 100, 768])
This method is simple and allows the model to learn the optimal way to represent positions for the specific task and data. However, it has drawbacks:
The original Transformer paper (Vaswani et al., 2017) proposed a fixed, non-learned positional encoding method using sine and cosine functions of varying frequencies. The motivation was to use a deterministic function that could potentially allow the model to attend to relative positions easily, since for any fixed offset k, PEpos+k could be represented as a linear function of PEpos. It also avoids the extra parameters of learned embeddings and might generalize better to unseen sequence lengths.
The formula for the positional encoding PE for a token at position pos and dimension index i is defined as:
PE(pos,2i)=sin(pos/100002i/dmodel) PE(pos,2i+1)=cos(pos/100002i/dmodel)
Here:
Each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000⋅2π. This choice allows the model to potentially learn to attend to relative positions, as the relative position information is encoded in the phase differences.
Let's implement this in PyTorch:
import torch
import math
import matplotlib.pyplot as plt
def get_sinusoidal_positional_encoding(seq_len, d_model):
"""Calculates the sinusoidal positional encoding."""
pe = torch.zeros(seq_len, d_model)
position = torch.arange(
0, seq_len, dtype=torch.float
).unsqueeze(1)
# Shape: (seq_len, 1)
# Term for calculating the frequencies
div_term = torch.exp(
torch.arange(0, d_model, 2).float()
* (-math.log(10000.0) / d_model)
)
# Shape: (d_model/2)
# Calculate sine for even indices, cosine for odd indices
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# Add batch dimension (optional, often done later)
# pe = pe.unsqueeze(0) # Shape: (1, seq_len, d_model)
return pe
# Example: Generate encoding for sequence length 100,
# model dimension 128
seq_len = 100
d_model = 128
fixed_pe = get_sinusoidal_positional_encoding(seq_len, d_model)
# Shape: (100, 128)
print(f"Shape of fixed positional embeddings: {fixed_pe.shape}")
# Output: Shape of fixed positional embeddings:
# torch.Size([100, 128])
# Visualize the first few dimensions
plt.figure(figsize=(10, 5))
# Plot dimensions 0, 2, 4, 6
for i in range(0, 8, 2):
plt.plot(fixed_pe[:, i].numpy(), label=f'Dim {i} (sin)')
# Plot dimensions 1, 3, 5, 7
for i in range(1, 9, 2):
plt.plot(
fixed_pe[:, i].numpy(),
label=f'Dim {i} (cos)',
linestyle='--'
)
plt.ylabel("Value")
plt.xlabel("Position")
plt.title("Sinusoidal Positional Encoding (First 8 Dimensions)")
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
First 8 dimensions of the sinusoidal positional encoding for the first 10 positions. Note how lower dimensions (smaller i) vary faster (higher frequency) than higher dimensions.
While effective and widely used, especially in the original Transformer and models like BERT, sinusoidal encodings are fixed. They might not be the optimal representation for all types of sequential patterns.
Both learned and fixed sinusoidal positional encodings are common starting points.
In practice, the choice might depend on the specific application, model size, and sequence length requirements. It's also worth noting that the field has developed more advanced techniques. Absolute positional encodings, whether learned or fixed, primarily encode the position of a token relative to the start of the sequence. However, it's often the relative position between tokens that matters most for attention. Techniques like Relative Positional Encoding and Rotary Position Embedding (RoPE) directly incorporate relative distance information into the attention mechanism itself. These more advanced methods are explored in Chapter 13.
For now, understanding learned and sinusoidal encodings provides the necessary foundation for how Transformers incorporate sequence order information, overcoming the permutation invariance of the core self-attention mechanism. This injection of positional data is a simple yet essential element enabling the Transformer's success on sequential data.
© 2025 ApX Machine Learning