Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.)DOI: 10.5555/3295222.3295289 - Introduces the Transformer architecture, including the scaled dot-product attention and the necessity for positional encodings due to its permutation invariance.
Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022 (O'Reilly Media) - Offers an accessible yet comprehensive explanation of Transformer models, detailing the architecture's components, including the need for and implementation of positional information.
Deep Sets, Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, Alexander J Smola, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30 (Curran Associates, Inc.)DOI: 10.5555/3295222.3295328 - A foundational paper that mathematically characterizes permutation-invariant neural networks, offering a theoretical context for understanding why basic self-attention treats input as a set.