Longformer: The Long-Document Transformer, Iz Beltagy, Matthew E. Peters, Arman Cohan, 2020arXiv preprint arXiv:2004.05150DOI: 10.48550/arXiv.2004.05150 - Introduces an attention mechanism with a complexity linear with sequence length by combining local and global attention patterns.
BigBird: A Self-Attention Mechanism for Long Context Transformers, Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed, 2020Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2007.14062 - Presents a sparse attention mechanism that achieves linear complexity and can approximate full attention.
Rethinking Attention with Performers, Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2009.14794 - Proposes the Performer model, which uses positive orthogonal random features to approximate softmax attention with linear complexity.