As we saw with SGD and its variants like Momentum, a single learning rate governs the update step for all parameters in our network. While techniques like learning rate scheduling can adjust this global rate over time, they don't account for the fact that different parameters might require different update magnitudes. Consider a scenario with sparse features: some parameters associated with rare features might only receive informative gradients occasionally, while parameters for frequent features receive updates constantly. A uniform learning rate might be too small for the infrequent parameters to learn effectively or too large for the frequent ones, causing instability.
The AdaGrad (Adaptive Gradient Algorithm) optimizer addresses this by maintaining a per-parameter learning rate that adapts based on the history of gradients observed for that parameter. The intuition is straightforward: parameters that have experienced large gradients frequently should have their learning rate reduced to prevent overshooting, while parameters associated with infrequent updates (and potentially smaller gradients historically) should maintain a larger learning rate to facilitate learning.
AdaGrad achieves this adaptation by accumulating the square of the gradients for each parameter over all previous time steps. Let gt,i be the gradient of the objective function with respect to parameter θi at time step t. AdaGrad maintains a state variable, let's call it Gt,i, for each parameter θi, which is the sum of the squares of the gradients with respect to θi up to time step t.
The update rule for Gt,i is:
Gt,i=Gt−1,i+gt,i2Here, G0,i is typically initialized to zero. This value, Gt,i, represents the historical gradient information accumulated for parameter θi.
When updating the actual parameter θi, AdaGrad modifies the standard gradient descent update by dividing the global learning rate η by the square root of this accumulated sum Gt,i. A small constant ϵ (epsilon, e.g., 10−8) is added to the denominator for numerical stability, preventing division by zero, especially during the initial steps when Gt,i might be zero.
The parameter update for θi at time step t is:
θt+1,i=θt,i−Gt,i+ϵηgt,iLet's break this down:
Notice how the effective learning rate for each parameter θi, which is Gt,i+ϵη, changes dynamically. If a parameter consistently receives large gradients, Gt,i will grow quickly, causing the effective learning rate for that parameter to decrease. Conversely, if a parameter receives small or infrequent gradients, Gt,i will grow slowly, and the effective learning rate will remain relatively high.
In practice, these operations are performed element-wise on vectors and matrices. If θt represents the vector of all parameters at time step t, gt is the gradient vector, and Gt is the vector accumulating the element-wise squares of gradients, the update rules are:
Gt=Gt−1+gt⊙gt θt+1=θt−Gt+ϵη⊙gtwhere ⊙ denotes element-wise multiplication, and the square root and addition of ϵ are also applied element-wise.
While innovative, AdaGrad has a significant property to keep in mind: the accumulated sum of squared gradients Gt only grows over time. Because Gt appears in the denominator, the effective learning rate for every parameter will monotonically decrease throughout training. In some cases, this decay can become too aggressive, causing the learning rate to shrink to near zero, effectively halting learning prematurely. This limitation motivated the development of subsequent adaptive algorithms like RMSprop and Adam, which we will explore next.
© 2025 ApX Machine Learning