Stochastic Gradient Descent (SGD), Momentum, and Nesterov Accelerated Gradient (NAG) are primary optimization algorithms. Observing their behavior empirically significantly aids understanding of their mechanics. A hands-on approach involves setting up a simple optimization problem and using visualizations to analyze how these algorithms navigate the loss surface and approach a minimum. The objective is not merely to execute code, but to interpret the results, linking observed convergence patterns with relevant theoretical concepts.
To compare optimizers effectively, we need a consistent testbed. Let's consider a standard problem like minimizing a simple quadratic function or training a logistic regression model on a small, perhaps synthetic, dataset. The point is consistency:
We will track the loss value at each iteration (or after processing each batch) for each optimizer. Plotting these loss values against the iteration number provides a convergence curve.
Let's consider optimizing a simple quadratic function like f(w1,w2)=(w1−5)2+(w2−3)2. While convex and simple, it helps illustrate the basic dynamics. We'll start all optimizers from a point like (w1,w2)=(0,0).
Assume we have functions to compute the loss and its gradient ∇f(w) for our chosen problem. The core update rules we'll compare are:
1. SGD: wt+1=wt−η∇f(wt) Where η is the learning rate.
2. Momentum: vt+1=βvt+η∇f(wt) wt+1=wt−vt+1 Where β is the momentum coefficient (e.g., 0.9) and v0=0.
3. Nesterov Accelerated Gradient (NAG): vt+1=βvt+η∇f(wt−βvt) wt+1=wt−vt+1 Where the gradient is computed at a "lookahead" position wt−βvt.
(Note: Practical implementations often use slightly different but equivalent formulations, especially for NAG. We'll focus on the difference.)
After running each optimizer for a fixed number of iterations (say, 100), we plot the loss at each iteration.
Convergence curves for SGD, Momentum, and NAG on a simple quadratic objective, plotted on a log scale for the loss. Note the faster decrease achieved by Momentum and especially NAG.
Analysis:
The learning rate (η) is a critical hyperparameter. Let's visualize its effect on a single optimizer, like SGD.
Comparing SGD convergence with different learning rates. A low rate leads to slow progress, a high rate causes divergence, and an appropriate rate achieves fast convergence.
Analysis:
Similar experiments can be run varying the momentum coefficient (β) for Momentum and NAG to observe how it influences the smoothing and acceleration effects.
"While the quadratic example is instructive, machine learning problems, especially in deep learning, involve highly non-convex loss surfaces. Running these optimizers on a non-convex function (even a simple 2D one like the Rastrigin function) can reveal different behaviors:"
Analyzing convergence plots in these scenarios often shows periods of rapid decrease followed by stagnation (local minimum or plateau) or more erratic behavior.
When analyzing convergence plots for foundational optimizers:
These practical observations form the basis for understanding why more advanced optimization techniques were developed. The limitations seen here, such as sensitivity to learning rates, slow convergence in certain landscapes, and issues with non-convexity, motivate the adaptive learning rate methods (Chapter 3) and second-order methods (Chapter 2) we will explore next.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with