In the previous sections, we explored two extremes for computing the gradient: using the entire dataset (Batch Gradient Descent) and using just a single example (Stochastic Gradient Descent). Batch Gradient Descent provides an accurate estimate of the gradient but can be computationally expensive and slow for large datasets. Stochastic Gradient Descent updates parameters very frequently, leading to faster iterations, but the updates are noisy and the convergence path can oscillate significantly.
Mini-batch Gradient Descent offers a practical and widely used compromise between these two approaches. Instead of using all m training examples or just one, it computes the gradient and updates the parameters using a small, randomly selected subset of the training data called a "mini-batch".
The core idea is to process the data in small batches. Let the size of a mini-batch be denoted by b. Common mini-batch sizes are powers of 2, such as 32, 64, 128, or 256, chosen often for hardware optimization reasons, but the optimal size can depend on the specific problem and dataset.
The algorithm proceeds as follows:
Notice that the gradient ∇JB(θ) is an estimate of the true gradient ∇J(θ) (the gradient calculated over the full dataset). It's less noisy than the SGD gradient (based on one example) but less accurate than the Batch GD gradient (based on all examples).
Mini-batch Gradient Descent strikes a beneficial balance:
The mini-batch size b is a hyperparameter that needs to be chosen.
The optimal mini-batch size often depends on the dataset size, model complexity, and hardware characteristics. It's typically found through experimentation.
Feature | Batch GD | Stochastic GD (SGD) | Mini-batch GD |
---|---|---|---|
Data per update | Entire dataset (m) | Single example (1) | Mini-batch (b) |
Update Frequency | Low (once per epoch) | High (m per epoch) | Medium (m/b per epoch) |
Gradient Quality | Exact (True Gradient) | Noisy (High Variance) | Estimate (Reduced Variance) |
Computation Cost | High per update | Low per update | Medium per update |
Convergence | Smooth, potentially slow | Noisy, potentially fast | Relatively smooth & fast |
Vectorization | Yes | Less effective | Yes (Effective) |
Memory Usage | Can be high | Low | Moderate |
Because it balances computational efficiency, update frequency, and convergence stability, Mini-batch Gradient Descent is the most commonly used optimization algorithm for training machine learning models, especially deep neural networks, on large datasets. It provides a practical way to navigate the cost function landscape effectively.
© 2025 ApX Machine Learning