Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing SystemsDOI: 10.48550/arXiv.1706.03762 - This paper introduces the Transformer architecture and the scaled dot-product attention mechanism, which is the basis for the content discussed.
Dive into Deep Learning, Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola, 2024 (Cambridge University Press) - An open-source interactive deep learning book providing a detailed explanation of attention scoring functions and the mathematical formulation of dot-product attention.
Speech and Language Processing (3rd edition draft), Daniel Jurafsky, James H. Martin, 2025 - An authoritative textbook on natural language processing, offering a rigorous mathematical treatment of attention mechanisms and Transformer models in its dedicated chapters.