Masterclass
As discussed in Chapter 4, the standard Transformer architecture relies on adding positional information to the input embeddings since the self-attention mechanism itself is permutation-invariant. The two most common approaches are learned absolute positional embeddings and fixed sinusoidal positional encodings. While functional, these methods have limitations, especially when dealing with very long sequences or when the precise relative positioning between tokens is significant. Understanding these drawbacks provides the motivation for exploring the more advanced techniques discussed later in this chapter.
One significant challenge arises when models encounter sequences longer than those seen during training.
Learned Absolute Embeddings: If a model is trained with learned positional embeddings for a maximum sequence length, say 512 tokens, it simply does not have embeddings defined for positions 513 and beyond. Using such a model on a 1024-token sequence during inference often leads to unpredictable behavior or a significant drop in performance because the model encounters inputs for which it has no positional representation. While techniques like extending the learned embeddings exist, they are often heuristics and may not perform reliably.
Sinusoidal Absolute Encodings: Fixed sinusoidal encodings, defined by mathematical functions (sine and cosine waves of varying frequencies), can theoretically generate encodings for any position.
PE(pos,2i)=sin(pos/100002i/dmodel) PE(pos,2i+1)=cos(pos/100002i/dmodel)Here, pos is the position and i is the dimension index within the dmodel-dimensional embedding. Because this is a deterministic function, we can compute PE(pos,⋅) for any pos. However, practical issues remain. While mathematically defined, the model itself may not have learned to effectively interpret the positional information for distances or absolute positions far outside its training distribution. The sinusoidal patterns might become less distinct or potentially alias at very large positions, making it harder for the model to differentiate between distant tokens accurately. The model's ability to generalize relies on learning patterns from the positional information it has seen, and extrapolating these patterns to much larger scales isn't guaranteed.
Consider the sinusoidal values for two dimensions over a range of positions. While unique, the patterns might become less discriminative for the model at larger scales than it was trained on.
Example sinusoidal values for two dimensions across increasing positions. While mathematically unique, generalization depends on the model learning to interpret these potentially subtle differences at scales beyond its training data.
Self-attention calculates the interaction between tokens based on their query (Q), key (K), and value (V) representations. With absolute positional encodings (P), these representations are typically formed by adding the positional encoding to the token embedding (E). For positions i and j, the attention score calculation involves terms like:
score(i,j)∝(Ei+Pi)TWQTWK(Ej+Pj)Here, WQ and WK are the weight matrices for queries and keys. Notice that the absolute positions Pi and Pj influence the score, but the relative position i−j is not explicitly encoded. The model must learn to interpret the combination of Pi and Pj to understand the relative distance and direction between tokens i and j. While models are capable of learning this implicitly to some extent, it might not be the most direct or efficient way to capture relationships that heavily depend on relative positioning, such as syntactic dependencies or local word interactions.
For many linguistic phenomena, the relationship between two words depends more on their distance apart than their absolute position in the sequence. For example, the grammatical relationship between an adjective and a noun often depends on them being adjacent or nearby, regardless of whether they appear at the beginning or middle of a long document. An encoding that directly incorporates relative distance could potentially make it easier for the attention mechanism to capture these local dependencies.
Consider the following simplified PyTorch-like pseudo-code illustrating how standard attention with added absolute encodings works:
import torch
import torch.nn.functional as F
# Simplified example parameters
batch_size = 1
seq_len = 5
embed_dim = 8
# Token embeddings (random)
token_embed = torch.randn(batch_size, seq_len, embed_dim)
# Absolute positional encodings (e.g., sinusoidal or learned)
pos_enc = torch.randn(batch_size, seq_len, embed_dim) # Simplified stand-in
# Input embeddings
input_embed = token_embed + pos_enc
# Simplified Query, Key projections (Linear layers omitted for clarity)
# In reality, Q = input_embed @ W_q, K = input_embed @ W_k
query = input_embed # Shape: (batch, seq_len, embed_dim)
key = input_embed # Shape: (batch, seq_len, embed_dim)
# Calculate attention scores
# Dot product attention (simplified)
# Shape: (batch, seq_len, seq_len)
attn_scores = torch.matmul(query, key.transpose(-2, -1))
# Scale scores
scale_factor = torch.sqrt(torch.tensor(embed_dim, dtype=torch.float32))
scaled_attn_scores = attn_scores / scale_factor
# Apply softmax
attn_weights = F.softmax(scaled_attn_scores, dim=-1)
# attn_weights[b, i, j] contains the attention from position i to position j
# Note how the relative distance (i-j) is not explicitly part of the calculation,
# it's implicitly derived from the interaction of (token_embed_i + pos_enc_i)
# and (token_embed_j + pos_enc_j).
print("Attention weights shape:", attn_weights.shape)
This implicit handling requires the model to dedicate capacity to disentangling relative positioning from the absolute position signals.
While sinusoidal encodings offer mathematical extrapolation, their fixed nature means they are not adapted to the specific characteristics of the training data or the downstream tasks. The sinusoidal form imposes a specific structure on positional relationships, which might not always be optimal. Learned embeddings offer more flexibility by adapting during training, but as discussed, they suffer from poor generalization to longer sequences. This trade-off between the extrapolation capability of fixed encodings and the adaptability of learned embeddings highlights a core limitation of absolute positional approaches.
These limitations concerning sequence length extrapolation, implicit relative position representation, and the fixed vs. learned trade-off have spurred research into alternative positional encoding methods. The following sections will introduce techniques like Relative Positional Encoding and Rotary Position Embedding (RoPE), which aim to address these shortcomings by incorporating relative position information more directly into the attention mechanism or modifying how positional information is integrated.
© 2025 ApX Machine Learning