While Stochastic Gradient Descent (SGD) offers computational advantages over batch gradient descent, its updates can be noisy, and its convergence can be slow, especially when navigating elongated valleys or ravines in the loss landscape. Imagine the loss surface as a hilly terrain. Vanilla SGD takes steps based only on the gradient at the current point, which can lead to erratic zig-zagging if the slope changes rapidly in different directions. This is particularly problematic in narrow ravines where the gradient points steeply across the valley but only gently along its bottom. SGD might bounce back and forth across the narrow axis, making very slow progress towards the minimum along the valley floor.
To address this, we can incorporate the concept of momentum, drawing inspiration from physics. Think of a ball rolling down a hill. It doesn't just move based on the slope at its current position; it also has momentum built up from its previous movement. This momentum helps it smooth out its path, roll over small bumps, and accelerate faster down consistent slopes.
SGD with Momentum applies a similar idea to parameter updates. Instead of using only the current gradient to update the parameters, we maintain a "velocity" vector, which is essentially an exponentially decaying moving average of past gradients. This velocity term represents the accumulated "momentum" of the parameter updates.
At each step t, we update the velocity vt and then use this velocity to update the parameters θt. The process looks like this:
Here:
Incorporating momentum brings several advantages:
Consider the difference in paths taken by SGD and SGD with Momentum in a typical loss landscape ravine:
Comparison of optimization paths for SGD (red) and SGD with Momentum (blue) on a hypothetical loss surface. Momentum takes a more direct path towards the minimum, avoiding the oscillations seen with standard SGD.
The momentum coefficient β determines the "memory" of the optimizer.
Common starting values for β are 0.9 or sometimes 0.99. Like the learning rate η, the optimal value for β is problem-dependent and often needs to be tuned through experimentation. It's important to remember that β and η interact, so adjusting one may require adjusting the other.
Most deep learning frameworks provide straightforward implementations of SGD with momentum. For example, in PyTorch, you can enable momentum by simply specifying the momentum
argument when creating an SGD optimizer instance:
import torch.optim as optim
# Assume 'model_parameters' holds your network's parameters
# Set learning rate and momentum coefficient
learning_rate = 0.01
momentum_beta = 0.9
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum_beta)
# --- Inside your training loop ---
# optimizer.zero_grad()
# # Forward pass, calculate loss
# loss = criterion(outputs, labels)
# # Backward pass
# loss.backward()
# # Update weights
# optimizer.step()
By accumulating a velocity based on past gradients, SGD with Momentum offers a significant improvement over vanilla SGD, often leading to faster convergence and more stable training, particularly for the complex, non-convex loss landscapes encountered in deep learning. It forms the basis for many subsequent, more advanced optimization algorithms.
© 2025 ApX Machine Learning