Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ćukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (NeurIPS)DOI: 10.5555/3295222.3295349 - This foundational paper introduced the Transformer architecture, detailing the encoder and decoder stacks, multi-head attention, residual connections, and layer normalization.
Natural Language Processing with Transformers, Lewis, Leandro, Thomas, 2022 (O'Reilly Media) - A practical guide covering the Transformer architecture in depth, including its components, training, and fine-tuning for various NLP tasks. It provides a contemporary perspective on applying Transformers.
Stanford CS224N: Natural Language Processing with Deep Learning, Winter 2023, Christopher Manning, Abigail See, and Kevin Clark, 2023 (Stanford University) - This graduate-level course offers detailed explanations of deep learning models for NLP, with dedicated lectures on the Transformer architecture, attention mechanisms, and its practical implementation.