In the previous section, we explored the standard Gradient Descent algorithm. It involves calculating the gradient of the loss function with respect to the model parameters using the entire training dataset. While this gives an accurate estimate of the gradient direction pointing towards the minimum, it comes with a significant computational cost, especially when dealing with large datasets. Imagine having millions or billions of training examples; computing the gradient across all of them for just a single parameter update becomes prohibitively slow and memory-intensive.
To address this challenge, we turn to Stochastic Gradient Descent (SGD). The core idea behind SGD is remarkably simple: instead of calculating the gradient based on the whole dataset, we approximate it using just one randomly selected training example at each step.
The process looks like this:
This cycle repeats for all examples in the dataset, completing one epoch. The process continues for multiple epochs.
SGD offers several advantages:
However, SGD also has disadvantages:
In practice, neither pure Batch Gradient Descent nor pure SGD is typically used. Instead, a compromise called Mini-Batch Gradient Descent is the most common approach.
Mini-Batch GD works by calculating the gradient and updating parameters based on a small, randomly selected subset (a "mini-batch") of the training data, rather than the entire dataset or just a single example. Typical mini-batch sizes range from 32 to 256 examples, but can vary depending on the application and hardware memory constraints.
The update rule becomes:
w←w−η∇J(w;x(i:i+n),y(i:i+n))where n is the mini-batch size, and ∇J(w;x(i:i+n),y(i:i+n)) is the average gradient over the mini-batch.
Mini-Batch GD offers several benefits:
The following diagram illustrates the conceptual difference in the optimization paths taken by Batch GD, SGD, and Mini-Batch GD on a hypothetical loss surface.
Comparison of Batch GD, SGD, and Mini-Batch GD characteristics. N is the total number of training examples, n is the mini-batch size.
Choosing between these variants depends on the dataset size and computational resources. For most deep learning applications today, Mini-Batch Gradient Descent is the default choice, providing a practical and efficient way to navigate the complex loss landscapes of neural networks. While we often refer to the optimization process simply as "SGD" in conversation or even in library implementations, it almost always implies the use of mini-batches.
© 2025 ApX Machine Learning