Stochastic Gradient Descent (SGD), Momentum, and Nesterov Accelerated Gradient (NAG) are primary optimization algorithms. Observing their behavior empirically significantly aids understanding of their mechanics. A hands-on approach involves setting up a simple optimization problem and using visualizations to analyze how these algorithms navigate the loss surface and approach a minimum. The objective is not merely to execute code, but to interpret the results, linking observed convergence patterns with relevant theoretical concepts.Setting Up a Controlled ExperimentTo compare optimizers effectively, we need a consistent testbed. Let's consider a standard problem like minimizing a simple quadratic function or training a logistic regression model on a small, perhaps synthetic, dataset. The point is consistency:Objective Function: Use the exact same loss function for all optimizers.Data: Use the exact same dataset (and batching strategy, if applicable) for each run.Initialization: Start all optimizers from the exact same initial parameter values ($w_0$).Iterations: Run each optimizer for the same number of iterations or epochs.We will track the loss value at each iteration (or after processing each batch) for each optimizer. Plotting these loss values against the iteration number provides a convergence curve.Let's consider optimizing a simple quadratic function like $f(w_1, w_2) = (w_1 - 5)^2 + (w_2 - 3)^2$. While convex and simple, it helps illustrate the basic dynamics. We'll start all optimizers from a point like $(w_1, w_2) = (0, 0)$.Implementing the Core LogicAssume we have functions to compute the loss and its gradient $\nabla f(w)$ for our chosen problem. The core update rules we'll compare are:1. SGD: $$w_{t+1} = w_t - \eta \nabla f(w_t)$$ Where $\eta$ is the learning rate.2. Momentum: $$v_{t+1} = \beta v_t + \eta \nabla f(w_t)$$ $$w_{t+1} = w_t - v_{t+1}$$ Where $\beta$ is the momentum coefficient (e.g., 0.9) and $v_0 = 0$.3. Nesterov Accelerated Gradient (NAG): $$v_{t+1} = \beta v_t + \eta \nabla f(w_t - \beta v_t)$$ $$w_{t+1} = w_t - v_{t+1}$$ Where the gradient is computed at a "lookahead" position $w_t - \beta v_t$.(Note: Practical implementations often use slightly different but equivalent formulations, especially for NAG. We'll focus on the difference.)Visualizing and Analyzing Convergence CurvesAfter running each optimizer for a fixed number of iterations (say, 100), we plot the loss at each iteration.{"layout": {"title": "Optimizer Convergence Comparison (Quadratic Function)", "xaxis": {"title": "Iteration"}, "yaxis": {"title": "Loss (Log Scale)", "type": "log"}, "legend": {"title": "Optimizer"}, "template": "plotly_white", "width": 700, "height": 450}, "data": [{"type": "scatter", "mode": "lines", "name": "SGD (lr=0.1)", "x": [0, 1, 2, 3, 4, 5, 10, 20, 30, 50, 70, 100], "y": [34.0, 21.76, 13.93, 8.91, 5.70, 3.65, 0.56, 0.014, 0.0003, 0.0, 0.0, 0.0], "line": {"color": "#4dabf7"}}, {"type": "scatter", "mode": "lines", "name": "Momentum (lr=0.1, beta=0.9)", "x": [0, 1, 2, 3, 4, 5, 10, 20, 30, 50, 70, 100], "y": [34.0, 22.1, 9.28, 3.03, 0.92, 0.30, 0.006, 0.0, 0.0, 0.0, 0.0, 0.0], "line": {"color": "#f06595"}}, {"type": "scatter", "mode": "lines", "name": "NAG (lr=0.1, beta=0.9)", "x": [0, 1, 2, 3, 4, 5, 10, 20, 30, 50, 70, 100], "y": [34.0, 15.64, 4.38, 1.02, 0.23, 0.05, 0.0002, 0.0, 0.0, 0.0, 0.0, 0.0], "line": {"color": "#51cf66"}}]}Convergence curves for SGD, Momentum, and NAG on a simple quadratic objective, plotted on a log scale for the loss. Note the faster decrease achieved by Momentum and especially NAG.Analysis:SGD: Observe the steady decrease in loss. On this simple convex problem, it reliably moves towards the minimum. Its rate might be slower compared to others.Momentum: Notice how the loss typically decreases faster than SGD, especially after the initial steps. The momentum term helps accelerate movement in consistent directions of descent. You might sometimes see slight overshooting or oscillation, but generally, it reaches lower loss values sooner.NAG: Often, NAG shows the fastest initial convergence. The lookahead gradient calculation allows it to anticipate changes and adjust its trajectory more effectively, often leading to quicker convergence and potentially dampening oscillations compared to standard Momentum.The Impact of Learning RateThe learning rate ($\eta$) is a critical hyperparameter. Let's visualize its effect on a single optimizer, like SGD.{"layout": {"title": "Effect of Learning Rate on SGD Convergence", "xaxis": {"title": "Iteration"}, "yaxis": {"title": "Loss (Log Scale)", "type": "log"}, "legend": {"title": "Learning Rate"}, "template": "plotly_white", "width": 700, "height": 450}, "data": [{"type": "scatter", "mode": "lines", "name": "lr=0.01 (Too Low)", "x": [0, 10, 20, 30, 50, 70, 100], "y": [34.0, 22.2, 14.6, 9.6, 4.1, 1.7, 0.5], "line": {"color": "#ffc078"}}, {"type": "scatter", "mode": "lines", "name": "lr=0.1 (Good)", "x": [0, 10, 20, 30, 50, 70, 100], "y": [34.0, 0.56, 0.014, 0.0003, 0.0, 0.0, 0.0], "line": {"color": "#4dabf7"}}, {"type": "scatter", "mode": "lines", "name": "lr=0.5 (Too High)", "x": [0, 1, 2, 3, 4, 5], "y": [34.0, 8.5, 53.1, 332.0, 2075.0, 12969.0], "line": {"color": "#ff8787"}}]}Comparing SGD convergence with different learning rates. A low rate leads to slow progress, a high rate causes divergence, and an appropriate rate achieves fast convergence.Analysis:Too Low ($\eta=0.01$): Convergence is very slow. The steps taken are too small to reach the minimum efficiently within the given iterations.Good ($\eta=0.1$): The loss decreases rapidly and effectively reaches the minimum.Too High ($\eta=0.5$): The optimizer diverges. The steps are too large, overshooting the minimum and causing the loss to increase dramatically. This highlights the sensitivity of convergence to the learning rate.Similar experiments can be run varying the momentum coefficient ($\beta$) for Momentum and NAG to observe how it influences the smoothing and acceleration effects.Exploring Non-Convex Scenarios"While the quadratic example is instructive, machine learning problems, especially in deep learning, involve highly non-convex loss surfaces. Running these optimizers on a non-convex function (even a simple 2D one like the Rastrigin function) can reveal different behaviors:"Local Minima: All these first-order methods can get stuck in local minima. The final converged loss might differ depending on the initialization and the optimizer's path.Saddle Points: Momentum and NAG might be slightly better at escaping saddle points compared to plain SGD, as the momentum term can carry them through flat regions. However, they are not immune.Plateaus: Performance on plateaus (regions where the gradient is very small but not zero) can also vary. Momentum might help traverse plateaus faster than SGD.Analyzing convergence plots in these scenarios often shows periods of rapid decrease followed by stagnation (local minimum or plateau) or more erratic behavior.Takeaways for AnalysisWhen analyzing convergence plots for foundational optimizers:Rate of Decrease: Look at the slope of the loss curve (especially on a log scale). Steeper slopes indicate faster convergence. Compare how quickly different optimizers reduce the loss.Stability: Does the loss decrease monotonically, or does it oscillate? Oscillations can indicate a learning rate that might be slightly too high or complex interactions. Extreme oscillations or increasing loss signal divergence.Final Loss Value: Where does the loss seem to level off? For convex problems, this should be near the global minimum. For non-convex problems, compare the final loss values achieved by different optimizers or different runs – they might converge to different points.Impact of Hyperparameters: Systematically vary learning rates and momentum coefficients to understand their effect on the speed and stability of convergence.These practical observations form the basis for understanding why more advanced optimization techniques were developed. The limitations seen here, such as sensitivity to learning rates, slow convergence in certain landscapes, and issues with non-convexity, motivate the adaptive learning rate methods (Chapter 3) and second-order methods (Chapter 2) we will explore next.