Masterclass
While absolute positional encodings, such as sinusoidal or learned embeddings, provide Transformers with necessary sequence order information, they treat each position independently. This can limit their ability to generalize to sequence lengths not seen during training and doesn't explicitly model the relationship between tokens based on their distance apart. The approach proposed by Shaw et al. (2018) in "Self-Attention with Relative Position Representations" offers an alternative by directly incorporating relative distances into the attention mechanism itself.
The core idea is to modify the calculation of attention scores and potentially the aggregation of value vectors, making them sensitive to the relative offset between interacting tokens. Instead of adding positional information only to the initial embeddings, this method introduces learnable embeddings that represent different relative distances.
Recall the standard scaled dot-product attention score between a query vector for position i (qi​=xi​WQ) and a key vector for position j (kj​=xj​WK):
score(qi​,kj​)=dk​​qi​kjT​​Shaw et al. introduce learnable relative position embeddings for keys, denoted as aK. These embeddings capture the relationship from position i to position j. Specifically, aijK​ represents the embedding for the relative distance j−i. The attention score calculation is then modified to include a term that incorporates this relative position information:
eij​=dk​​(xi​WQ)(xj​WK)T+(xi​WQ)(aijK​)T​Let's break down this modified score eij​:
The final attention weights αij​ are obtained by applying the softmax function to these modified scores eij​.
The attention score calculation in Shaw et al.'s method incorporates both content similarity and a bias derived from the relative positional embedding for the key.
Calculating and storing unique embeddings for every possible relative distance in a very long sequence would be inefficient and likely unnecessary. The relationship between tokens that are very far apart might be less informative or follow a general pattern. Therefore, it's common practice to clip the maximum relative distance considered.
A maximum distance k is chosen. Any relative distance j−i where ∣j−i∣>k is clipped to −k or k. For example, if k=8, the relative distance j−i=10 would be treated as 8, and j−i=−12 would be treated as −8. This means the model only needs to learn embeddings for relative distances in the range [−k,k], resulting in 2k+1 unique relative position embeddings.
In a practical PyTorch implementation, you would typically:
Define the Relative Position Embedding Layer: Create an nn.Embedding
layer to store the learnable embeddings aK. Its size would be (2 * max_relative_position + 1, head_dim)
.
import torch
import torch.nn as nn
max_relative_position = 8 # Example maximum distance
head_dim = 64 # Example dimension per attention head
# We need embeddings for distances from -k to +k, so 2k+1 total.
num_relative_embeddings = 2 * max_relative_position + 1
relative_key_embeddings = nn.Embedding(num_relative_embeddings, head_dim)
Calculate Relative Positions: For a given sequence length seq_len
, compute the matrix of relative positions between all query (i) and key (j) positions.
seq_len = 512 # Example sequence length
range_vec = torch.arange(seq_len)
relative_pos_matrix = range_vec[None, :] - range_vec[:, None] # Shape: [seq_len, seq_len]
Clip and Map Positions to Indices: Clip the relative positions and shift them to be non-negative indices suitable for the embedding lookup.
clipped_relative_pos = torch.clamp(relative_pos_matrix,
-max_relative_position,
max_relative_position)
# Shift indices to be 0 to 2k
embedding_indices = clipped_relative_pos + max_relative_position
Lookup Embeddings: Retrieve the corresponding aijK​ embeddings.
# Shape: [seq_len, seq_len, head_dim]
rel_key_embeds = relative_key_embeddings(embedding_indices)
Calculate Relative Attention Term: Compute the (xi​WQ)(aijK​)T term. This requires careful tensor manipulation to perform the dot product between each query qi​ and the corresponding relative key embeddings aijK​ for all j.
# Assume queries 'q' has shape [batch_size, num_heads, seq_len, head_dim]
# Assume rel_key_embeds has shape [seq_len, seq_len, head_dim]
# We need to compute torch.einsum('bhqd, Lqd -> bhqL', q, rel_key_embeds)
# or similar efficient computation. This part requires careful implementation.
# Example using matrix multiplication after reshaping/permuting:
# queries_r: [batch_size * num_heads, seq_len, head_dim]
# rel_key_embeds_r: [head_dim, seq_len, seq_len] (permuted) -> Needs careful indexing
# relative_logits = torch.matmul(queries_r, rel_key_embeds_r) # Simplified concept
# A common efficient approach involves reshaping queries and embeddings
# and using batch matrix multiplication (bmm), often referred to as 'skewing' logic
# in implementations like T5 or Tensor2Tensor.
# Placeholder for the calculated relative term (shape: [batch, heads, seq_len, seq_len])
relative_logits = torch.zeros(q.shape[0], q.shape[1], seq_len, seq_len, device=q.device)
# --- Actual calculation logic would go here ---
Note: The exact implementation for efficiently computing the relative attention term can be complex, often involving specific reshaping and matrix multiplication strategies detailed in library implementations or the original paper. The goal is to compute the dot product between each query qi​ and all relevant aijK​ vectors.
Combine Scores: Add the relative_logits
to the content_logits
before scaling and applying softmax.
# Assume content_logits has shape [batch, heads, seq_len, seq_len]
# content_logits = torch.matmul(q, k.transpose(-2, -1)) # Standard content attention
combined_logits = content_logits + relative_logits
attention_scores = combined_logits / (head_dim ** 0.5)
attention_weights = F.softmax(attention_scores, dim=-1)
Shaw et al. also proposed adding a similar relative position embedding term, aijV​, when aggregating the value vectors vj​=xj​WV. The output zi​ for position i would become:
zi​=j∑​αij​(xj​WV+aijV​)This modification is less frequently highlighted than the key embedding modification but follows the same principle: making the output sensitive to the relative positions of the attended tokens. It requires a separate embedding layer relative_value_embeddings
.
Advantages:
Disadvantages:
This method represents an important step towards incorporating relative positional awareness directly into the Transformer's attention mechanism. While effective, it's one of several approaches, and subsequent chapters will discuss alternatives like the relative encoding scheme used in Transformer-XL and Rotary Position Embedding (RoPE), which address similar goals through different mechanisms.
© 2025 ApX Machine Learning