Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (Curran Associates, Inc.)DOI: 10.48550/arXiv.1706.03762 - Original paper introducing the Transformer architecture and the scaled dot-product attention mechanism, including the role of softmax.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Foundational textbook providing a detailed mathematical explanation of the softmax function and its properties.