Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30DOI: 10.48550/arXiv.1706.03762 - The original paper introducing the Transformer architecture and the Scaled Dot-Product Attention mechanism.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 (Pearson) - A comprehensive textbook with a dedicated chapter on attention and Transformers, explaining the scaling factor's importance.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - This book provides a background on activation functions like softmax and the phenomenon of vanishing gradients in deep learning.