Sparse Transformers, Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever, 2019arXiv preprint arXiv:1904.10509DOI: 10.48550/arXiv.1904.10509 - Introduces methods for sparse attention using fixed and strided patterns to reduce quadratic complexity.
Longformer: The Long-Document Transformer, Iz Beltagy, Matthew E. Peters, and Arman Cohan, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2004.05150 - Presents the Longformer model, incorporating sliding window and global attention for processing long documents.
Big Bird: Transformers for Longer Sequences, Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed, 2020Advances in Neural Information Processing Systems, Vol. 33 (Curran Associates, Inc.)DOI: 10.48550/arXiv.2007.14062 - Details the Big Bird model, an efficient Transformer variant using a sparse attention mechanism with local, global, and random attention.