As highlighted in the chapter introduction, Stochastic Gradient Descent (SGD) is a workhorse for training models on large datasets. Its strength lies in using small data subsets (mini-batches) for gradient estimation, drastically reducing the computational cost per iteration compared to calculating the gradient over the entire dataset (Batch Gradient Descent). However, this efficiency comes at a cost: variance.
Each stochastic gradient estimate, git(θt)=∇fit(θt) (where fit is the loss on a mini-batch it at iteration t), is an unbiased estimate of the true gradient ∇f(θt)=N1∑i=1N∇fi(θt). This means, in expectation, the stochastic gradient points in the right direction: E[git(θt)]=∇f(θt). But for any single iteration, git(θt) can deviate significantly from ∇f(θt). This difference introduces noise or variance into the parameter updates.
This inherent variance in SGD has several practical consequences:
Comparison of optimization paths on a simple quadratic loss surface. Batch GD takes smooth steps. SGD exhibits noisy steps due to gradient variance. Variance Reduced methods aim for faster, less noisy convergence than SGD.
The core idea behind variance reduction techniques is to modify the stochastic gradient estimate to have lower variance, while retaining computational efficiency. We want an estimate g~t(θt) such that:
Achieving this typically involves incorporating additional information into the gradient estimate. For instance, methods might periodically compute the full gradient or maintain moving averages of past gradients evaluated at different points. The goal is to construct a gradient estimate that corrects the noisy single-batch gradient towards the true gradient direction.
By reducing the variance, these methods often achieve faster theoretical convergence rates (approaching the linear rates of Batch GD for certain problem classes) and demonstrate better practical performance compared to standard SGD, especially in the later stages of optimization.
The subsequent sections will examine specific algorithms like Stochastic Average Gradient (SAG) and Stochastic Variance Reduced Gradient (SVRG), which implement different strategies to achieve this variance reduction, making large-scale optimization more efficient and effective.
© 2025 ApX Machine Learning