Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (Curran Associates, Inc.)DOI: 10.55988/neurips-2017-1011 - This paper introduces the Transformer architecture and the scaled dot-product attention mechanism, providing its original formulation and discussing the benefits of parallelization.
Dive into Deep Learning, Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola, et al., 2023 (Cambridge University Press) - An online textbook that provides explanations and practical implementations of deep learning models, featuring a section on attention mechanisms and their matrix operations.
Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022 (O'Reilly Media) - A practical guide that expands on the theoretical aspects of Transformers, showing how the matrix operations are implemented and utilized in modern NLP applications.