In the previous section, we saw how AdaGrad adapts the learning rate for each parameter by accumulating the squares of past gradients. While this helps navigate varying curvatures in the loss landscape, it introduces a potential issue: the accumulated sum in the denominator, ∑gi2, grows monotonically. Over many iterations, especially in deep learning scenarios with non-convex objectives, this denominator can become very large, causing the effective learning rate to shrink towards zero prematurely. This can effectively halt the learning process long before convergence is reached.
RMSprop (Root Mean Square Propagation), an unpublished adaptive learning rate method proposed by Geoff Hinton, directly addresses this issue. Instead of letting the sum of squared gradients accumulate indefinitely, RMSprop uses an exponentially decaying average of squared gradients. This means older gradients gradually contribute less to the average, preventing the denominator from growing unboundedly large and stopping learning.
Let's look at the mechanism. At each time step t, RMSprop first computes the gradient gt=∇θJ(θt) for the current parameters θt. It then maintains a moving average E[g2]t of the squared gradients, updated as follows:
E[g2]t=γE[g2]t−1+(1−γ)gt2Here:
Compare this to AdaGrad's denominator, which effectively has γ=1 (no decay) and keeps adding the current squared gradient. By using γ<1, RMSprop ensures that the influence of very old gradients diminishes over time.
The parameter update rule for RMSprop then uses this moving average:
θt+1=θt−E[g2]t+ϵηgtWhere:
The core idea is intuitive: if recent gradients for a parameter have been large, E[g2]t will be large, and the effective learning rate for that parameter (η/E[g2]t+ϵ) will decrease. Conversely, if recent gradients have been small, the effective learning rate will increase. Crucially, because E[g2]t is a moving average, it can also decrease if recent gradients become small again, allowing the learning rate to recover, unlike AdaGrad.
RMSprop introduces the decay rate γ alongside the global learning rate η and the stability constant ϵ.
In essence, RMSprop modifies AdaGrad's denominator to use a leaky average instead of a cumulative sum. This simple change prevents the aggressive learning rate decay that can stall AdaGrad, leading to a more effective and commonly used adaptive optimization algorithm. It forms a stepping stone towards Adam, which further refines this idea by incorporating momentum.
© 2025 ApX Machine Learning