Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (Curran Associates, Inc.)DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture and its associated learning rate schedule, including warmup, which became standard for large language models.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A comprehensive textbook covering fundamental deep learning concepts, including detailed explanations of gradient clipping as an optimization stabilization technique.
torch.nn.utils.clip_grad_norm_, PyTorch Contributors, 2024 - Official PyTorch documentation for implementing gradient norm clipping, including usage examples and parameter descriptions.
How to adjust learning rate, PyTorch Contributors, 2025 - Official PyTorch documentation explaining learning rate schedulers, including examples for LinearLR, CosineAnnealingLR, and SequentialLR demonstrated in the course content.