While AdaGrad introduces the valuable concept of per-parameter adaptive learning rates, its specific mechanism for accumulating gradient history leads to a significant practical issue. Recall the AdaGrad update rule, where the effective learning rate for a parameter θi at timestep t is scaled by the inverse square root of the sum of squared historical gradients for that parameter:
θt+1,i=θt,i−Gt,ii+ϵηgt,iHere, gt,i is the gradient of the loss with respect to parameter θi at timestep t, η is the global learning rate, ϵ is a small constant for numerical stability, and Gt,ii is the sum of the squares of the gradients with respect to θi from timestep 1 to t:
Gt,ii=τ=1∑tgτ,i2The core limitation stems from the accumulation term Gt,ii. Because gτ,i2 is always non-negative (being a squared value), the sum Gt,ii will monotonically increase throughout training, assuming the gradient is non-zero. It never decreases.
As training progresses, Gt,ii continuously grows larger. Consequently, the denominator term Gt,ii+ϵ also grows larger. This causes the effective learning rate, Gt,ii+ϵη, to shrink monotonically towards zero.
Example illustration of how the effective learning rate for two different parameters might decrease over training iterations using AdaGrad. Parameters experiencing larger gradients (Parameter 1) see a much faster decay compared to those with smaller gradients (Parameter 2), but both trend towards zero. Initial global learning rate η=0.01.
This overly aggressive decay can be problematic. In deep learning, where optimization landscapes are complex and non-convex, training often requires exploration even late in the process. If the learning rate becomes vanishingly small too early, the optimizer might effectively stop making progress long before reaching a satisfactory minimum. The model's ability to learn can be prematurely halted.
While AdaGrad was an important step towards adaptive learning rates, this limitation prompted the development of algorithms that could adapt learning rates without causing them to decay quite so aggressively. Methods like RMSprop, which we will explore next, modify the accumulation mechanism to prevent this unbounded growth.
© 2025 ApX Machine Learning