Self-Attention with Relative Position Representations, Peter Shaw, Jakob Uszkoreit, Ashish Vaswani, 2018Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (Association for Computational Linguistics)DOI: 10.18653/v1/N18-2074 - This paper introduces one of the earliest explicit formulations for incorporating relative positional information into the self-attention mechanism by adding learned relative position embeddings to keys and values.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, Ruslan Salakhutdinov, 2019Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics)DOI: 10.18653/v1/P19-1285 - This work presents an efficient relative positional encoding scheme that reformulates the attention score calculation to decompose it into content and relative position terms, which helps with better handling of longer sequences.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, 2020JMLR, Vol. 21 (JMLR) - This paper describes the T5 model, which employs a simplified variant of relative positional encodings, showcasing their effectiveness in large-scale pre-training for various natural language processing tasks.
DeBERTa: Decoding-enhanced BERT with Disentangled Attention, Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, 2021International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2006.03654 - This research introduces a disentangled attention mechanism that refines relative positional encoding by treating content and relative position embeddings as separate vectors, leading to strong performance in many NLP benchmarks.