Rethinking Attention with Performers, Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller, 2021International Conference on Learning Representations (ICLR 2021)DOI: 10.48550/arXiv.2009.14794 - The original research paper introducing the Performer architecture and the FAVOR+ mechanism for linear-time attention approximation.
Random Features for Large-Scale Kernel Machines, Ali Rahimi, Benjamin Recht, 2007Advances in Neural Information Processing Systems, Vol. 20 (NeurIPS Foundation)DOI: 10.5591/978-1-57735-703-5.2016.1177 - A foundational paper that introduced the concept of random Fourier features to approximate shift-invariant kernels, a direct inspiration for Performer's kernel approximation.
Efficient Transformers: A Survey, Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler, 2022ACM Computing Surveys, Vol. 55 (Association for Computing Machinery (ACM))DOI: 10.1145/3530811 - Provides a comprehensive overview of various techniques for making Transformer models more efficient, including different linear attention mechanisms like Performer.