While absolute positional encodings, as discussed in Chapter 4, provide a mechanism for Transformers to understand sequence order, they treat each position independently. The sinusoidal encoding offers nice properties for modeling relative distances implicitly, but it's added before the attention mechanism operates. Learned absolute embeddings might struggle to generalize to sequence lengths longer than those seen during training. An alternative approach directly incorporates the relative distance between tokens into the attention calculation itself. This is the core idea behind Relative Positional Encodings (RPE).
The intuition is that the relationship between two words often depends more on how far apart they are rather than their absolute positions in the sequence. For instance, knowing that a verb follows its subject by one position might be a more generalizable pattern than knowing the subject is at position 5 and the verb is at position 6. RPE aims to make the model directly aware of these relative distances.
Instead of adding positional information to the input embeddings, RPE modifies the self-attention scoring mechanism. The standard scaled dot-product attention calculates scores between a query qi for position i and a key kj for position j as:
score(qi,kj)=dkqiTkjWhere qi=xiWQ and kj=xjWK, with xi,xj being input embeddings and WQ,WK being projection matrices.
Relative positioning schemes inject information about the relationship between i and j directly into this calculation. Several variations exist, but they generally involve adding terms that depend on the relative distance i−j.
One of the earlier influential approaches proposed adding learned relative position embeddings directly to the keys (and sometimes values) before the dot product. Let aijK and aijV represent learnable embedding vectors corresponding to the relative position between query i and key/value j. The attention score calculation is modified as:
eij=dk(xiWQ)T(xjWK+aijK)The output value zi is then computed using a similar modification for the value vectors:
zi=j∑softmax(eij)(xjWV+aijV)Here, aijK and aijV are typically retrieved from embedding lookup tables indexed by the relative distance j−i. To keep the number of embeddings manageable, the relative distance is often clipped to a maximum value k. That is, all distances j−i>k map to the same embedding ai,i+kK, and distances j−i<−k map to ai,i−kK.
This approach directly injects relative spatial biases into the attention score. However, it requires computing and storing these relative embeddings for each query-key pair within the attention matrix computation, which can be computationally intensive.
A more efficient and widely adopted approach, introduced with Transformer-XL, reformulates the attention calculation to elegantly incorporate relative positions. Recall the standard attention score involving absolute positional embeddings Pi,Pj:
Ai,jabs=(Exi+Pi)TWQTWK(Exj+Pj)Expanding this gives four terms: content-content (ExiT…Exj), content-position (ExiT…Pj), position-content (PiT…Exj), and position-position (PiT…Pj).
The relative formulation modifies this expansion:
This leads to the following decomposed attention score calculation:
Ai,jrel=(a) content-basedExiTWQTWKExj+(b) content-relative positionExiTWQTWKRi−j+(c) global content biasuTWKExj+(d) global positional biasvTWKRi−jCrucially, this formulation can be implemented efficiently. The terms involving Ri−j can be computed without explicitly constructing pairwise relative embeddings for all (i,j) pairs. Instead, clever tensor manipulations allow computing terms (b) and (d) efficiently across all positions simultaneously.
In summary, relative positional encodings offer a compelling alternative to absolute positional encodings by embedding sequence order information directly into the attention mechanism's score calculation. By focusing on pairwise distances rather than absolute locations, they can provide better generalization and capture distance-sensitive relationships more effectively, proving valuable in various modern Transformer architectures.
© 2025 ApX Machine Learning