Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30, Vol. 30 (Curran Associates, Inc.)DOI: 10.55989/nips.2017.p5998 - This foundational paper introduced the Transformer architecture and the self-attention mechanism, leading to significant advancements in sequence modeling.
torch.nn API Reference, PyTorch Core Team, 2024 (PyTorch) - Official documentation for PyTorch's neural network module, providing definitions and usage examples for layers like nn.Linear, nn.Embedding, and nn.LayerNorm.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A comprehensive textbook covering the theoretical foundations and practical aspects of deep learning, including architectures relevant to Transformers.
CS224N: Natural Language Processing with Deep Learning, Stanford University, 2024 (Stanford University) - An advanced university course offering detailed lectures and resources on deep learning for natural language processing, including in-depth discussions on Transformer models.