The standard Gradient Descent algorithm calculates the gradient of the loss function with respect to model parameters by utilizing the entire training dataset. Although this approach offers an accurate estimate of the gradient direction, which points towards the minimum loss, it comes with a substantial computational cost, particularly when dealing with vast datasets. For instance, with millions or billions of training examples, computing the gradient across all of them for a single parameter update can be prohibitively slow and memory-intensive.
To address this challenge, we turn to Stochastic Gradient Descent (SGD). The core idea behind SGD is remarkably simple: instead of calculating the gradient based on the whole dataset, we approximate it using just one randomly selected training example at each step.
The Mechanics of SGD
The process looks like this:
- Shuffle: At the beginning of each epoch (a full pass through the training data), shuffle the dataset randomly. This prevents any bias introduced by the order of examples.
- Iterate: Loop through the shuffled dataset, taking one example (x(i),y(i)) at a time.
- Compute Gradient: Calculate the gradient of the loss function J only for that single example: ∇J(w;x(i),y(i)).
- Update Parameters: Update the model parameters (weights w and biases b) using this single-example gradient:
w←w−η∇J(w;x(i),y(i))
where η is the learning rate.
This cycle repeats for all examples in the dataset, completing one epoch. The process continues for multiple epochs.
Why Use SGD? Advantages and Disadvantages
SGD offers several advantages:
- Speed: Parameter updates happen much more frequently (once per example instead of once per epoch). This often leads to faster convergence in terms of wall-clock time, especially on large datasets, even if it takes more updates overall.
- Large Datasets: It can handle massive datasets that don't fit into memory, as it only needs one example at a time.
- Escaping Local Minima: The updates in SGD are noisy because the gradient is calculated from a single example, which might not perfectly represent the overall gradient direction. This noise can actually be beneficial, potentially helping the optimization process jump out of shallow local minima or saddle points and find better solutions.
However, SGD also has disadvantages:
- Noisy Updates: The high variance in gradient estimates leads to a more erratic path towards the minimum compared to Batch Gradient Descent. The loss might fluctuate significantly between updates.
- Convergence: Due to the noise, SGD might not converge to the exact minimum but rather oscillate around it. Careful tuning of the learning rate, often involving a learning rate schedule (gradually decreasing η over time), is usually required.
Mini-Batch Gradient Descent: The Best of Both Worlds
In practice, neither pure Batch Gradient Descent nor pure SGD is typically used. Instead, a compromise called Mini-Batch Gradient Descent is the most common approach.
Mini-Batch GD works by calculating the gradient and updating parameters based on a small, randomly selected subset (a "mini-batch") of the training data, rather than the entire dataset or just a single example. Typical mini-batch sizes range from 32 to 256 examples, but can vary depending on the application and hardware memory constraints.
The update rule becomes:
w←w−η∇J(w;x(i:i+n),y(i:i+n))
where n is the mini-batch size, and ∇J(w;x(i:i+n),y(i:i+n)) is the average gradient over the mini-batch.
Mini-Batch GD offers several benefits:
- Smoother Convergence: The gradient estimate is less noisy than in SGD, leading to a more stable convergence path.
- Computational Efficiency: It uses hardware optimizations (like GPUs) by processing examples in batches using vectorized operations, making it computationally much faster than processing examples one by one as in SGD.
- Balance: It strikes a good balance between the robustness of SGD and the stability of Batch GD.
The following diagram illustrates the difference in the optimization paths taken by Batch GD, SGD, and Mini-Batch GD on a loss surface.
Comparison of Batch GD, SGD, and Mini-Batch GD characteristics. N is the total number of training examples, n is the mini-batch size.
Choosing between these variants depends on the dataset size and computational resources. For most deep learning applications today, Mini-Batch Gradient Descent is the default choice, providing a practical and efficient way to navigate the complex loss landscapes of neural networks. While we often refer to the optimization process simply as "SGD" in conversation or even in library implementations, it almost always implies the use of mini-batches.