Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, John Duchi, Elad Hazan, Yoram Singer, 2011Journal of Machine Learning Research, Vol. 12 - This foundational paper introduces the AdaGrad algorithm, detailing its adaptive learning rate mechanism based on the accumulation of past squared gradients.
Lecture 6e: RmsProp, Tijmen Tieleman, Geoffrey Hinton, 2012 - Presented in the Coursera course 'Neural Networks for Machine Learning,' this work introduces RMSprop, an optimizer designed to address AdaGrad's rapid learning rate decay.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - This authoritative textbook provides a comprehensive discussion of optimization algorithms in deep learning, including an analysis of AdaGrad's properties and its limitations.