Absolute positional encodings provide a mechanism for Transformers to understand sequence order, but they treat each position independently. The sinusoidal encoding offers nice properties for modeling relative distances implicitly, but it's added before the attention mechanism operates. Learned absolute embeddings might struggle to generalize to sequence lengths longer than those seen during training. An alternative approach directly incorporates the relative distance between tokens into the attention calculation itself. This is the core idea behind Relative Positional Encodings (RPE).
The intuition is that the relationship between two words often depends more on how far apart they are rather than their absolute positions in the sequence. For instance, knowing that a verb follows its subject by one position might be a more generalizable pattern than knowing the subject is at position 5 and the verb is at position 6. RPE aims to make the model directly aware of these relative distances.
Modifying the Attention Score
Instead of adding positional information to the input embeddings, RPE modifies the self-attention scoring mechanism. The standard scaled dot-product attention calculates scores between a query qi for position i and a key kj for position j as:
score(qi,kj)=dkqiTkj
Where qi=xiWQ and kj=xjWK, with xi,xj being input embeddings and WQ,WK being projection matrices.
Relative positioning schemes inject information about the relationship between i and j directly into this calculation. Several variations exist, but they generally involve adding terms that depend on the relative distance i−j.
Shaw et al. (2018) Formulation
One of the earlier influential approaches proposed adding learned relative position embeddings directly to the keys (and sometimes values) before the dot product. Let aijK and aijV represent learnable embedding vectors corresponding to the relative position between query i and key/value j. The attention score calculation is modified as:
eij=dk(xiWQ)T(xjWK+aijK)
The output value zi is then computed using a similar modification for the value vectors:
zi=j∑softmax(eij)(xjWV+aijV)
Here, aijK and aijV are typically retrieved from embedding lookup tables indexed by the relative distance j−i. To keep the number of embeddings manageable, the relative distance is often clipped to a maximum value k. That is, all distances j−i>k map to the same embedding ai,i+kK, and distances j−i<−k map to ai,i−kK.
This approach directly injects relative spatial biases into the attention score. However, it requires computing and storing these relative embeddings for each query-key pair within the attention matrix computation, which can be computationally intensive.
Transformer-XL / Dai et al. (2019) Formulation
A more efficient and widely adopted approach, introduced with Transformer-XL, reformulates the attention calculation to elegantly incorporate relative positions. Recall the standard attention score involving absolute positional embeddings Pi,Pj:
Ai,jabs=(Exi+Pi)TWQTWK(Exj+Pj)
Expanding this gives four terms: content-content (ExiT…Exj), content-position (ExiT…Pj), position-content (PiT…Exj), and position-position (PiT…Pj).
The relative formulation modifies this expansion:
Replace absolute position Pj in the key projection: In terms involving the key's position, replace the absolute position Pj with a relative position encoding Ri−j that represents the offset between i and j. R can be a fixed sinusoidal encoding matrix (similar to the original Transformer's PE, but used differently) or learned embeddings.
Introduce trainable position biases: Replace the query's absolute position term PiTWQT with two trainable vectors, u and v. These vectors represent global "positional biases" for content and relative position, respectively.
This leads to the following decomposed attention score calculation:
Ai,jrel=(a) content-basedExiTWQTWKExj+(b) content-relative positionExiTWQTWKRi−j+(c) global content biasuTWKExj+(d) global positional biasvTWKRi−j
Term (a) is identical to the content interaction in standard attention.
Term (b) captures how the query content at i relates to the relative position i−j.
Term (c) provides a bias based purely on the key content at j.
Term (d) provides a bias based purely on the relative position i−j.
Crucially, this formulation can be implemented efficiently. The terms involving Ri−j can be computed without explicitly constructing pairwise relative embeddings for all (i,j) pairs. Instead, clever tensor manipulations allow computing terms (b) and (d) efficiently across all positions simultaneously.
Implementation Aspects
Clipping Distance: As with the Shaw et al. approach, a maximum relative distance k is often used when indexing relative position embeddings (Ri−j). This assumes that very long-range interactions might not need precise distance information.
Embedding Type: The relative position representation R can be based on sinusoidal functions (providing generalization capabilities) or learned embeddings (potentially more expressive but requiring more data and parameters).
Sharing: Relative position embeddings (whether sinusoidal or learned) are often shared across different attention heads and sometimes across layers to reduce the parameter count.
Advantages of Relative Positional Encoding
Improved Generalization: RPEs, especially sinusoidal ones or those with clipping, may generalize better to sequence lengths not seen during training compared to learned absolute PEs. The model learns patterns based on distance rather than specific locations.
Direct Distance Modeling: The attention mechanism becomes directly aware of the relative positioning between tokens, which can be beneficial for tasks where local syntax or relative order is important.
Empirical Success: RPEs form a component of several high-performing models, including Transformer-XL, T5, and DeBERTa, demonstrating their practical effectiveness.
Disadvantages of Relative Positional Encoding
Increased Complexity: The attention calculation becomes more involved compared to the baseline Transformer, although efficient implementations like the Transformer-XL formulation mitigate the computational overhead significantly compared to naive approaches.
Hyperparameters: Introduces choices like the clipping distance k and the type of relative encoding (sinusoidal vs. learned), which may require tuning.
In summary, relative positional encodings offer a compelling alternative to absolute positional encodings by embedding sequence order information directly into the attention mechanism's score calculation. By focusing on pairwise distances rather than absolute locations, they can provide better generalization and capture distance-sensitive relationships more effectively, proving valuable in various modern Transformer architectures.
Was this section helpful?
Self-Attention with Relative Position Representations, Peter Shaw, Jakob Uszkoreit, Ashish Vaswani, 2018Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (Association for Computational Linguistics)DOI: 10.18653/v1/N18-2074 - This paper introduces one of the earliest explicit formulations for incorporating relative positional information into the self-attention mechanism by adding learned relative position embeddings to keys and values.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, Ruslan Salakhutdinov, 2019Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics)DOI: 10.18653/v1/P19-1285 - This work presents an efficient relative positional encoding scheme that reformulates the attention score calculation to decompose it into content and relative position terms, which helps with better handling of longer sequences.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, 2020JMLR, Vol. 21 (JMLR) - This paper describes the T5 model, which employs a simplified variant of relative positional encodings, showcasing their effectiveness in large-scale pre-training for various natural language processing tasks.
DeBERTa: Decoding-enhanced BERT with Disentangled Attention, Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, 2021International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2006.03654 - This research introduces a disentangled attention mechanism that refines relative positional encoding by treating content and relative position embeddings as separate vectors, leading to strong performance in many NLP benchmarks.