While standard Gradient Descent (SGD) and its Momentum variant update all parameters using the same learning rate (potentially scaled by momentum), this might not be optimal. Imagine a loss landscape that is very steep in one direction but quite flat in another. A single learning rate might cause oscillations in the steep direction or move too slowly in the flat direction. We need a way to adapt the learning rate individually for each parameter.
RMSprop (Root Mean Square Propagation) is an optimization algorithm designed to address this issue by maintaining a per-parameter learning rate. It achieves this by keeping track of a moving average of the squared gradients for each parameter. The core idea is to divide the learning rate for a specific weight by a running average of the magnitudes of recent gradients for that weight.
RMSprop modifies the gradient descent update rule. For each parameter (let's use a weight w as an example), it calculates an exponentially decaying average of squared gradients. Let Sdw be this moving average for weight w at a given iteration. The update rule for Sdw is:
Sdw=βSdw+(1−β)(∂w∂L)2Here:
The parameter update rule then uses this moving average to scale the learning rate α:
w=w−αSdw+ϵ∂w∂LSimilarly, for a bias parameter b:
Sdb=βSdb+(1−β)(∂b∂L)2 b=b−αSdb+ϵ∂b∂LThe term Sdw+ϵ (or Sdb+ϵ) is the root mean square (RMS) of the recent gradients that gives the algorithm its name. The small value ϵ (epsilon, e.g., 10−8) is added for numerical stability to prevent division by zero in cases where Sdw might become extremely close to zero.
How does this help?
Essentially, RMSprop automatically adjusts the step size for each parameter based on the historical magnitude of its gradients.
Advantages:
Considerations:
RMSprop was a significant step forward in optimization algorithms. It addressed the limitations of a single global learning rate by introducing per-parameter adaptation based on the magnitude of recent gradients.
Implementing RMSprop in a framework like PyTorch is straightforward. You simply select RMSprop
from the torch.optim
module when configuring your optimizer.
import torch
import torch.nn as nn
import torch.optim as optim
# Assume you have a model defined, e.g.:
# model = nn.Sequential(nn.Linear(10, 5), nn.ReLU(), nn.Linear(5, 1))
# Define input data and target labels
# inputs = torch.randn(64, 10)
# labels = torch.randn(64, 1)
# loss_fn = nn.MSELoss()
# Example: Define a simple linear model for demonstration
model = nn.Linear(10, 1)
# Define hyperparameters
learning_rate = 0.001
# Note: PyTorch uses 'alpha' for the smoothing constant (beta in our notation)
beta_rms = 0.99
epsilon = 1e-8
# Instantiate the RMSprop optimizer
optimizer = optim.RMSprop(model.parameters(),
lr=learning_rate,
alpha=beta_rms, # This is the decay rate beta
eps=epsilon,
momentum=0) # Standard RMSprop has no momentum term here
# --- Example Training Step ---
# Assume inputs and labels are available
# optimizer.zero_grad() # Clear previous gradients
# outputs = model(inputs) # Forward pass
# loss = loss_fn(outputs, labels) # Calculate loss
# loss.backward() # Backpropagation
# optimizer.step() # Update weights using RMSprop
# -----------------------------
print(f"Optimizer created: {optimizer}")
A simple example showing how to instantiate the RMSprop optimizer in PyTorch. Note that the parameter
alpha
inoptim.RMSprop
corresponds to the decay rate β discussed in the algorithm description. Standard RMSprop doesn't inherently include a momentum term, although PyTorch's implementation allows adding one (set to 0 here for the basic version).
RMSprop provides an effective way to adapt learning rates during training. The next section introduces the Adam optimizer, which combines the adaptive learning rate approach of RMSprop with the momentum concept we saw earlier, creating one of the most widely used optimizers in deep learning today.
© 2025 ApX Machine Learning