As we saw with AdaGrad, adapting the learning rate for each parameter based on its gradient history is a powerful idea. However, AdaGrad's primary drawback is its persistent accumulation of squared gradients in the denominator. Over time, this sum grows continuously, causing the effective learning rate to shrink monotonically and eventually become infinitesimally small. This aggressive decay can prematurely stop the learning process, especially in non-convex optimization landscapes common in deep learning where progress might be needed even after many iterations.
RMSprop (Root Mean Square Propagation) was developed independently around the same time as AdaGrad, specifically to address this rapid decay issue. The central innovation in RMSprop is the use of an exponentially decaying moving average of the squared gradients, rather than summing up all past squared gradients. This approach gives more weight to recent gradient information and effectively "forgets" the distant past.
Instead of maintaining a sum Gt of all past squared gradients for each parameter, RMSprop maintains an estimate of the mean square of the gradients, denoted as E[g2]t. This is updated at each time step t using a decay rate hyperparameter, β (often between 0.9 and 0.99):
E[g2]t=βE[g2]t−1+(1−β)gt2Here:
This calculation ensures that E[g2]t is dominated by more recent squared gradients. If gradients have been large recently, E[g2]t will be large, and if they have been small, it will decrease.
The parameter update rule for RMSprop then uses the square root of this moving average in the denominator, similar to AdaGrad, but crucially, using the adaptive E[g2]t instead of the ever-growing sum:
θt+1=θt−E[g2]t+ϵηgtWhere:
By using the exponentially decaying average E[g2]t, RMSprop prevents the denominator term E[g2]t+ϵ from growing monotonically throughout training. It adapts to the recent magnitude of gradients for each parameter. If a parameter's gradients become small after initially being large, E[g2]t will decrease over time, allowing the effective learning rate for that parameter to potentially increase again, unlike in AdaGrad where it would remain suppressed. This adaptability makes RMSprop significantly more effective for training deep neural networks over extended periods.
Consider how the denominators might evolve conceptually:
The AdaGrad denominator term tends to grow continuously as it sums all past squared gradients. The RMSprop denominator adapts based on a moving average, allowing it to decrease if recent gradients become smaller, preventing aggressive learning rate decay. (Note: This is a conceptual illustration assuming a gradient sequence that is large initially then becomes small).
RMSprop is a widely used optimizer available in all major deep learning frameworks. When using it, you typically need to specify the learning rate η and often the decay factor β (commonly named alpha
or rho
in libraries, with a default around 0.9 or 0.99) and epsilon ϵ. While it generally requires less tuning of the learning rate compared to SGD, finding good values for η and β can still influence performance.
import torch
import torch.optim as optim
import torch.nn as nn
# Assume 'model' is your defined neural network (nn.Module)
# Example: model = nn.Linear(10, 2)
# Initialize RMSprop optimizer
# Common parameters: parameters to optimize, learning rate (lr), alpha (decay factor), epsilon (eps)
optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99, eps=1e-08)
# --- Inside your training loop ---
# loss.backward() # Compute gradients
# optimizer.step() # Update parameters using RMSprop
# optimizer.zero_grad() # Reset gradients for the next iteration
# --- End training loop ---
RMSprop provides a robust and often effective alternative to AdaGrad, laying the groundwork for further improvements seen in optimizers like Adam, which we will discuss next.
© 2025 ApX Machine Learning