Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017arXivDOI: 10.48550/arXiv.1706.03762 - The foundational paper introducing the Transformer architecture and detailing its components, including the masked self-attention mechanism in the decoder.
Dive into Deep Learning, Aston Zhang, Zack C. Lipton, Mu Li, and Alexander J. Smola, 2024 (Cambridge University Press) - A comprehensive open-source textbook that provides detailed explanations and practical implementations of deep learning models, with a dedicated chapter on the Transformer architecture and its masked self-attention.