Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Chapters 4 and 8 provide a comprehensive explanation of gradient descent optimization, including the role of the learning rate and its practical challenges in deep learning.
Adam: A Method for Stochastic Optimization, Diederik P. Kingma, Jimmy Ba, 2015International Conference for Learning RepresentationsDOI: 10.48550/arXiv.1412.6980 - This paper introduces Adam, a widely adopted adaptive learning rate algorithm that addresses many limitations of fixed learning rates by dynamically adjusting step sizes for individual parameters.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, John Duchi, Elad Hazan, and Yoram Singer, 2011Journal of Machine Learning Research (JMLR), Vol. 12 - Presents AdaGrad, an early and influential adaptive learning rate algorithm that scales learning rates based on the historical sum of squared gradients for each parameter.
Lecture 6.5 - RMSprop: Divide the gradient by a running average of its recent magnitude, Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky, 2012Coursera Lecture Notes (University of Toronto) (University of Toronto) - The original lecture slides introducing RMSprop, an adaptive learning rate method developed to mitigate AdaGrad's aggressively diminishing learning rates.