Relying on a single, fixed learning rate $\eta$ for all parameters across the entire training process can be inefficient. Parameters associated with frequently occurring features might benefit from smaller updates to avoid divergence, while parameters tied to sparse features might require larger updates to make meaningful progress. Manually tuning separate rates for potentially millions of parameters is infeasible.AdaGrad (Adaptive Gradient Algorithm) is one of the first algorithms designed to address this challenge by introducing per-parameter learning rates that adapt automatically based on the history of gradients observed for each parameter.The Core Idea: Scaling Learning Rates Inversely to Past GradientsThe intuition behind AdaGrad is straightforward: parameters that have experienced large gradients frequently in the past should have their learning rates reduced, while parameters that have seen small or infrequent gradients should have their learning rates increased (or, more accurately, reduced less). This promotes faster progress on flatter directions of the loss surface (often associated with sparse features) and more careful steps along steeper directions (often associated with dense features).The AdaGrad Update RuleAdaGrad modifies the standard gradient descent update. For each parameter $w_i$ at timestep $t$, the update is:$$ w_{t+1, i} = w_{t, i} - \frac{\eta}{\sqrt{G_{t, ii} + \epsilon}} g_{t, i} $$Let's break down the components:$w_{t, i}$: The value of the $i$-th parameter at timestep $t$.$\eta$: A global learning rate, similar to standard gradient descent. Although AdaGrad adapts rates per parameter, this initial global rate still sets the overall scale.$g_{t, i}$: The gradient of the loss function with respect to the parameter $w_i$ at timestep $t$, i.e., $\frac{\partial L}{\partial w_{t, i}}$.$G_{t, ii}$: The main part. This is a running sum of the squares of the gradients for parameter $w_i$ from the first timestep up to the current timestep $t$. Specifically: $$ G_{t, ii} = \sum_{\tau=1}^{t} g_{\tau, i}^2 $$ In practice, this is often computed iteratively. If $G_{t-1}$ stores the sums up to the previous step, then $G_t$ can be computed by adding the element-wise square of the current gradient vector $g_t$: $G_t = G_{t-1} + g_t \odot g_t$, where $\odot$ denotes element-wise multiplication. $G_{t,ii}$ is the $i$-th diagonal element of this accumulated matrix (or simply the $i$-th element if $G_t$ is stored as a vector).$\epsilon$: A small smoothing term (e.g., $10^{-7}$ or $10^{-8}$) added to the denominator to prevent division by zero, especially during the initial steps when $G_{t, ii}$ might be zero.Notice how the update rule incorporates $G_{t, ii}$. The term $\sqrt{G_{t, ii} + \epsilon}$ acts as a per-parameter scaling factor. If a parameter $w_i$ has consistently had large gradients ($g_{\tau, i}$), its corresponding $G_{t, ii}$ will grow large, making the denominator large and thus reducing the effective learning rate $\frac{\eta}{\sqrt{G_{t, ii} + \epsilon}}$ for that specific parameter. Conversely, if a parameter has seen small or zero gradients (common for sparse features), $G_{t, ii}$ will remain small, resulting in a larger effective learning rate for that parameter.Strengths of AdaGradParameter-Specific Adaptation: Automatically tunes learning rates for individual parameters, often eliminating the need for extensive manual tuning compared to basic SGD.Effective for Sparse Data: Particularly well-suited for problems with sparse features (like natural language processing tasks using word embeddings or recommendation systems). Infrequent features receive larger updates, allowing the model to learn effectively from them despite their rarity.Weakness: The Aggressively Decaying Learning RateAdaGrad's primary drawback stems directly from its core mechanism: the accumulation of squared gradients in the denominator ($G_{t, ii}$). Since squared gradients are always non-negative, this sum grows monotonically throughout training. As $G_{t, ii}$ increases, the effective learning rate $\frac{\eta}{\sqrt{G_{t, ii} + \epsilon}}$ continuously shrinks for every parameter.In many scenarios, especially in deep learning where training can span many epochs, this decay becomes too aggressive. The learning rate can diminish to near-zero values relatively quickly, causing the training process to slow down drastically or even stop prematurely, long before reaching a good minimum.This limitation motivated the development of subsequent adaptive methods like RMSprop and Adam, which aim to retain the benefits of per-parameter adaptation while mitigating the issue of aggressively decaying learning rates. We will explore these methods next.