Okay, let's put theory into practice. Having reviewed the mechanics of Stochastic Gradient Descent (SGD), Momentum, and Nesterov Accelerated Gradient (NAG), the best way to solidify understanding is to observe their behavior empirically. In this hands-on section, we'll set up a simple optimization problem and use visualizations to analyze how these algorithms navigate the loss surface and approach a minimum. Our goal isn't just to run the code, but to interpret the results, connecting the observed convergence patterns back to the theoretical concepts discussed earlier in the chapter.
To compare optimizers effectively, we need a consistent testbed. Let's consider a standard problem like minimizing a simple quadratic function or training a logistic regression model on a small, perhaps synthetic, dataset. The key is consistency:
We will track the loss value at each iteration (or after processing each batch) for each optimizer. Plotting these loss values against the iteration number provides a convergence curve.
Let's consider optimizing a simple quadratic function like f(w1,w2)=(w1−5)2+(w2−3)2. While convex and simple, it helps illustrate the basic dynamics. We'll start all optimizers from a point like (w1,w2)=(0,0).
Assume we have functions to compute the loss and its gradient ∇f(w) for our chosen problem. The core update rules we'll compare are:
1. SGD: wt+1=wt−η∇f(wt) Where η is the learning rate.
2. Momentum: vt+1=βvt+η∇f(wt) wt+1=wt−vt+1 Where β is the momentum coefficient (e.g., 0.9) and v0=0.
3. Nesterov Accelerated Gradient (NAG): vt+1=βvt+η∇f(wt−βvt) wt+1=wt−vt+1 Where the gradient is computed at a "lookahead" position wt−βvt.
(Note: Practical implementations often use slightly different but equivalent formulations, especially for NAG. We'll focus on the conceptual difference.)
After running each optimizer for a fixed number of iterations (say, 100), we plot the loss at each iteration.
Convergence curves for SGD, Momentum, and NAG on a simple quadratic objective, plotted on a log scale for the loss. Note the faster decrease achieved by Momentum and especially NAG.
Analysis:
The learning rate (η) is a critical hyperparameter. Let's visualize its effect on a single optimizer, like SGD.
Comparing SGD convergence with different learning rates. A low rate leads to slow progress, a high rate causes divergence, and an appropriate rate achieves fast convergence.
Analysis:
Similar experiments can be run varying the momentum coefficient (β) for Momentum and NAG to observe how it influences the smoothing and acceleration effects.
While the quadratic example is instructive, real-world machine learning problems, especially in deep learning, involve highly non-convex loss surfaces. Running these optimizers on a non-convex function (even a simple 2D one like the Rastrigin function) can reveal different behaviors:
Analyzing convergence plots in these scenarios often shows periods of rapid decrease followed by stagnation (local minimum or plateau) or more erratic behavior.
When analyzing convergence plots for foundational optimizers:
These practical observations form the basis for understanding why more advanced optimization techniques were developed. The limitations seen here, such as sensitivity to learning rates, slow convergence in certain landscapes, and issues with non-convexity, motivate the adaptive learning rate methods (Chapter 3) and second-order methods (Chapter 2) we will explore next.
© 2025 ApX Machine Learning