Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.1706.03762 - This paper introduced the Transformer architecture and its learning rate schedule, which features a linear warmup phase. It became highly influential for the adoption of warmup in complex deep learning models.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He, 2017arXiv preprint arXiv:1706.02677DOI: 10.48550/arXiv.1706.02677 - This research highlights the effectiveness of learning rate warmup when training with very large mini-batches. It addresses initial instability and helps achieve faster convergence, which is a significant aspect discussed in the section.
Learning rate scheduling - PyTorch Documentation, PyTorch Authors, 2025 (PyTorch Foundation) - The official documentation for PyTorch's learning rate schedulers, offering detailed guidance and examples for implementing various schedules, including warmup strategies like LambdaLR demonstrated in the section.