Masterclass
While absolute positional encodings, such as sinusoidal or learned embeddings, provide Transformer models with essential sequence order information, they treat each position independently. The encoding for position i, denoted as Pi​, is generated without direct reference to any other position j. This approach has limitations. For instance, it's not immediately clear how well these absolute encodings generalize if a model trained on sequences of length 512 encounters a sequence of length 1024 during inference. Furthermore, the attention mechanism inherently computes relationships between pairs of tokens (query i attends to key j). It might be more natural if the positional information incorporated into the attention calculation explicitly reflected the relative distance or relationship between positions i and j, rather than just their absolute locations.
This observation motivates the development of relative positional encoding schemes. The core idea is to modify the attention mechanism or the inputs to it so that the relative position difference, typically i−j, influences the attention score between token i and token j. Instead of just adding Pi​ to the input embedding of token i, relative schemes aim to inject information about the offset between the query and key positions directly into their interaction calculation.
Consider the standard scaled dot-product attention:
Attention(Q,K,V)=softmax(dk​​QKT​)VHere, the interaction between query i (row i of Q) and key j (row j of K) is captured by the dot product qi​⋅kj​. Absolute positional encodings are typically added to the initial embeddings before they are projected into Q and K. Relative positional encoding aims to modify this interaction based on the relative position i−j.
There are generally two main ways this relative information can be incorporated:
Let's visualize the difference:
Absolute encoding defines positions relative to a fixed origin, while relative encoding focuses on the directed relationship between pairs of positions.
Why pursue this? Relative encodings offer several potential advantages:
A simplified way to think about modifying attention scores involves calculating the relative offset between every query position i and key position j. This offset can then be used to look up a learned embedding representing that specific relative distance.
import torch
import torch.nn as nn
# --- Illustrative Snippet ---
# Assume B=batch_size, L=seq_len, H=num_heads, D=head_dim
# Assume query, key are shaped [B, H, L, D]
seq_len = 10
max_relative_distance = 4 # Only explicitly model nearby relative positions
# Query positions: 0, 1, ..., 9
query_pos = torch.arange(seq_len, dtype=torch.long)
# Key positions: 0, 1, ..., 9
key_pos = torch.arange(seq_len, dtype=torch.long)
# Calculate relative positions (matrix[i, j] = i - j)
# Shape: [L, L]
relative_pos = query_pos.unsqueeze(1) - key_pos.unsqueeze(0)
print(f"Raw relative positions (i-j) for L=5:\n{relative_pos[:5, :5]}")
# Clip relative positions to a range
# [-max_relative_distance, max_relative_distance]
# Map these clipped values to indices [0, 2*max_relative_distance]
clipped_relative_pos = torch.clamp(
relative_pos,
-max_relative_distance,
max_relative_distance
)
relative_pos_indices = clipped_relative_pos + max_relative_distance
print(f"\nClipped relative indices for L=5:\n{relative_pos_indices[:5, :5]}")
# Assume we have an embedding table for relative positions
num_relative_embeddings = 2 * max_relative_distance + 1
# e.g., for -4 to +4 -> 9 embeddings
# Embedding dims: often just 1 (scalar bias per head)
# or D (vector bias per head)
# Let's assume scalar bias per head, embedding dim = num_heads (H)
num_heads = 8
relative_embedding = nn.Embedding(num_relative_embeddings, num_heads) # Example H=8
# Look up the bias based on relative position indices
# Shape: [L, L, H]
relative_position_bias = relative_embedding(relative_pos_indices)
# Reshape/permute bias to match attention score shape
# [B, H, L, L]
# Add this bias to the QK^T dot product before softmax
# Example (ignoring batch dim B for simplicity):
# attn_scores shape: [H, L, L]
# relative_position_bias shape: [L, L, H] -> permute to [H, L, L]
bias_for_scores = relative_position_bias.permute(2, 0, 1) # Now [H, L, L]
# Attention calculation modification:
# query shape: [H, L, D], key shape: [H, L, D]
# attn_scores = torch.matmul(query, key.transpose(-2, -1))
# / (D**0.5) # [H, L, L]
# attn_scores = attn_scores + bias_for_scores # Add relative bias
# attn_probs = torch.softmax(attn_scores, dim=-1)
print(f"\nShape of lookup indices: {relative_pos_indices.shape}")
print(f"Shape of relative embedding table output: {relative_position_bias.shape}")
print(f"Shape of permuted bias for attention scores: {bias_for_scores.shape}")
This snippet illustrates the core process: calculate relative indices, clip them to manage embedding table size, look up corresponding embeddings, and prepare them to be added to attention scores.
Of course, this is a simplified view. Efficient and effective implementations require careful handling of embedding lookups, potential sharing of embeddings across layers, and integrating this naturally into the Transformer block. The following sections will examine specific, well-established methods like Shaw et al.'s approach, Transformer-XL's relative encoding, and Rotary Position Embeddings (RoPE) that implement these concepts in practice.
© 2025 ApX Machine Learning