As we saw, standard Gradient Descent updates weights based solely on the gradient calculated at the current position. While this works, it can lead to inefficient training, especially in certain types of loss landscapes. Imagine a narrow ravine or valley in the loss surface. Gradient Descent tends to oscillate back and forth across the steep walls of the ravine, making slow progress along the bottom towards the minimum. Similarly, it can get stuck in local minima or slow down considerably on saddle points.
To address these issues, we can introduce the concept of momentum, borrowing intuition from physics. Imagine a ball rolling down a hill. Instead of just calculating the steepest slope at its current location and moving slightly in that direction (like standard GD), the ball has momentum. Its current velocity influences its movement in the next step. If it's already moving quickly in a certain direction, it tends to keep going that way, even if the immediate slope changes slightly. This momentum helps it:
Gradient Descent with Momentum applies this physical analogy to the optimization process. It introduces a "velocity" vector, v, which accumulates an exponentially decaying moving average of past gradients. The weight update then considers both the current gradient and this velocity.
The update rule for Gradient Descent with Momentum involves two steps at each iteration t:
Update the velocity vector vt:
vt=βvt−1+η∇L(wt−1)Here:
This step essentially calculates the new velocity by taking a fraction (β) of the old velocity and adding the scaled current gradient (η∇L(wt−1)). If the current gradient points consistently in the same direction as the previous velocity, the velocity magnitude increases. If the gradient direction oscillates, the velocity tends to be dampened because opposing gradient terms partially cancel out over time.
Update the weights wt:
wt=wt−1−vtThe weights are updated by moving in the direction of the newly computed velocity vector vt.
By incorporating the velocity vt, which holds information about recent gradient history, the updates become smoother and faster, particularly along directions where the gradient is consistent.
Consider optimizing a function with a narrow valley. Standard Gradient Descent might bounce between the walls, while Momentum takes a more direct route.
Optimization paths on a sample loss surface. Standard Gradient Descent (pink) oscillates across the narrow valley, making slow progress towards the minimum (near origin). Momentum (blue) dampens these oscillations and accelerates along the valley floor, taking a more direct path.
The momentum coefficient β controls the influence of past gradients.
Common values for β are 0.9 or even higher, like 0.99. It often works well to start with 0.9 and potentially increase it during training. This parameter, like the learning rate, may require tuning for optimal performance on a specific problem.
Using momentum in frameworks like PyTorch is straightforward. When defining the optimizer, you simply specify the momentum
parameter.
import torch
import torch.nn as nn
import torch.optim as optim
# Assume 'model' is your defined neural network (nn.Module)
# Assume 'learning_rate' is your chosen learning rate (e.g., 0.01)
# Define the optimizer using SGD with momentum
momentum_coefficient = 0.9
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum_coefficient)
# --- Inside your training loop ---
# loss.backward() # Compute gradients
# optimizer.step() # Update weights using SGD with momentum
# optimizer.zero_grad() # Reset gradients for next iteration
By simply adding the momentum
argument to the optim.SGD
constructor, the optimizer automatically implements the velocity calculation and weight update steps described earlier.
Gradient Descent with Momentum is a significant improvement over standard gradient descent, often leading to faster convergence and better navigation of complex loss surfaces. While it helps overcome many challenges, further refinements led to the development of adaptive methods like RMSprop and Adam, which we will explore next.
© 2025 ApX Machine Learning