As we discussed, relying on a single, fixed learning rate η for all parameters across the entire training process can be inefficient. Parameters associated with frequently occurring features might benefit from smaller updates to avoid divergence, while parameters tied to sparse features might require larger updates to make meaningful progress. Manually tuning separate rates for potentially millions of parameters is infeasible.
AdaGrad (Adaptive Gradient Algorithm) is one of the first algorithms designed to address this challenge by introducing per-parameter learning rates that adapt automatically based on the history of gradients observed for each parameter.
The intuition behind AdaGrad is straightforward: parameters that have experienced large gradients frequently in the past should have their learning rates reduced, while parameters that have seen small or infrequent gradients should have their learning rates increased (or, more accurately, reduced less). This promotes faster progress on flatter directions of the loss surface (often associated with sparse features) and more careful steps along steeper directions (often associated with dense features).
AdaGrad modifies the standard gradient descent update. For each parameter wi at timestep t, the update is:
wt+1,i=wt,i−Gt,ii+ϵηgt,iLet's break down the components:
Notice how the update rule incorporates Gt,ii. The term Gt,ii+ϵ acts as a per-parameter scaling factor. If a parameter wi has consistently had large gradients (gτ,i), its corresponding Gt,ii will grow large, making the denominator large and thus reducing the effective learning rate Gt,ii+ϵη for that specific parameter. Conversely, if a parameter has seen small or zero gradients (common for sparse features), Gt,ii will remain small, resulting in a larger effective learning rate for that parameter.
AdaGrad's primary drawback stems directly from its core mechanism: the accumulation of squared gradients in the denominator (Gt,ii). Since squared gradients are always non-negative, this sum grows monotonically throughout training. As Gt,ii increases, the effective learning rate Gt,ii+ϵη continuously shrinks for every parameter.
In many scenarios, especially in deep learning where training can span many epochs, this decay becomes too aggressive. The learning rate can diminish to near-zero values relatively quickly, causing the training process to slow down drastically or even stop prematurely, long before reaching a good minimum.
This limitation motivated the development of subsequent adaptive methods like RMSprop and Adam, which aim to retain the benefits of per-parameter adaptation while mitigating the issue of aggressively decaying learning rates. We will explore these methods next.
© 2025 ApX Machine Learning