Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Provides a comprehensive foundation for deep learning optimization techniques, including the role of learning rates and basic scheduling approaches.
Adam: A Method for Stochastic Optimization, Diederik P. Kingma and Jimmy Ba, 2015International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1412.6980 - Presents the Adam optimizer, a widely used adaptive method that is often combined with global learning rate schedules.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture and demonstrates the effectiveness of a learning rate schedule that includes a warmup phase.