First-order optimization methods are the workhorses for training many machine learning models, especially deep neural networks. While this course focuses on advanced techniques, a solid understanding of the foundational gradient descent variants is necessary. These algorithms iteratively adjust model parameters to minimize a loss function, primarily relying on the gradient, which indicates the direction of steepest ascent.
The most fundamental algorithm is Gradient Descent (often called Batch Gradient Descent or BGD). It computes the gradient of the loss function J(θ) with respect to the parameters θ using the entire training dataset. The parameter update rule at iteration t is:
θt+1=θt−η∇θJ(θt)Here, η is the learning rate, a hyperparameter controlling the step size. While GD guarantees convergence to a local minimum for non-convex functions and the global minimum for convex functions (given an appropriate learning rate), computing the gradient over the full dataset can be computationally infeasible for the large datasets common in modern machine learning. Each update requires processing every single training example.
To address the computational burden of BGD, Stochastic Gradient Descent (SGD) computes the gradient and updates the parameters using only a single randomly chosen training example (x(i),y(i)) at each step:
θt+1=θt−η∇θJ(θt;x(i),y(i))Alternatively, a small batch of m examples (a mini-batch) is often used, which provides a compromise between the computational efficiency of single-sample SGD and the more accurate gradient estimate of BGD. This is commonly referred to as Mini-batch Gradient Descent, but often simply called SGD in practice.
θt+1=θt−ηm1i=1∑m∇θJ(θt;x(i),y(i))The main characteristic of SGD (both single-sample and mini-batch) is the stochasticity or noise introduced into the gradient estimate. While this makes the updates much faster, the trajectory towards the minimum is often erratic, exhibiting high variance. This noise can sometimes help escape shallow local minima, but it also means SGD typically zigzags towards the optimum and might not settle precisely at the minimum without decaying the learning rate.
Comparison between Batch Gradient Descent and Stochastic Gradient Descent.
The high variance in SGD updates can cause oscillations, particularly in directions where the loss surface curves steeply, while progress along flatter directions remains slow. The Momentum method aims to dampen these oscillations and accelerate convergence by adding a fraction γ (typically around 0.9) of the previous update vector vt−1 to the current gradient step.
The update rules are:
vt=γvt−1+η∇θJ(θt)θt+1=θt−vtThink of a ball rolling down a hill. Momentum (vt) accumulates velocity in directions where the gradient consistently points downhill, allowing it to traverse flat regions faster and smooth out the update trajectory by averaging gradients over time. This often leads to faster convergence compared to standard SGD.
Nesterov Accelerated Gradient (NAG) is a refinement of the Momentum method. Instead of calculating the gradient at the current position θt and then adding the momentum term, NAG takes a "lookahead" step. It first applies the momentum update (approximating the next position) and then calculates the gradient at this future approximate position.
The update rules, often implemented slightly differently but conceptually capturing the lookahead idea, are:
vt=γvt−1+η∇θJ(θt−γvt−1)θt+1=θt−vtThe term θt−γvt−1 represents the approximate future position. By computing the gradient at this lookahead position, NAG anticipates where the parameters are heading and can correct the trajectory more effectively. This "smarter" momentum often allows for faster convergence and better performance than standard Momentum, especially on problems with complex loss surfaces.
Illustration of convergence paths for SGD, Momentum, and NAG on a hypothetical loss surface. SGD exhibits more oscillation, while Momentum and NAG take more direct paths towards the optimum (red circle).
These first-order methods, particularly SGD with Momentum or NAG, form the basis for many optimizers used today. However, they rely on a single learning rate η for all parameters and don't inherently adapt to the geometry of the loss surface. Understanding their behavior and limitations motivates the exploration of more sophisticated techniques like adaptive learning rates and second-order methods, which we will cover in subsequent chapters.
© 2025 ApX Machine Learning