While standard Gradient Descent, often referred to as Batch Gradient Descent (BGD), provides a solid theoretical foundation for optimization, it runs into significant practical hurdles when applied to the large datasets and complex models common in deep learning. Calculating the gradient requires summing contributions from every single data point in the training set before making a single parameter update. Let's examine why this approach becomes problematic.
The most immediate challenge is the sheer computational expense. Deep learning models often train on datasets containing millions or even billions of examples. Consider a dataset with N examples. To perform one update step using BGD, you need to:
This entire process must be repeated for each update step (epoch). The cost of calculating the gradient scales linearly with the size of the dataset, N. When N is very large, each step becomes incredibly slow, making training times prohibitively long.
Furthermore, depending on the implementation and model complexity, calculating gradients for the entire dataset might require loading a significant amount of data into memory, which can exceed the capacity of available hardware (like GPUs).
Large datasets often contain redundant information. Many data points might be similar and produce gradients pointing in roughly the same direction. BGD processes all these similar examples in every single step, even though a smaller sample might provide a good enough estimate of the true gradient direction. This exhaustive computation over redundant data makes each update step less efficient than it could be. Imagine learning to recognize cats by looking at 1 million nearly identical pictures of the same cat before adjusting your understanding; you could likely learn faster by looking at a smaller, more diverse set of cat pictures.
The loss landscapes of deep neural networks are highly complex and non-convex, featuring numerous local minima (points that are optimal within a neighborhood but not globally) and saddle points (points where the gradient is zero, but which are not minima).
Batch Gradient Descent calculates the exact gradient based on the entire dataset. While this provides a 'true' direction of steepest descent for the current parameters over the whole dataset, this smooth, averaged gradient might lack the 'noise' needed to escape certain undesirable regions of the loss surface.
A hypothetical loss surface showing a global minimum (low point), a local minimum (higher valley), and a saddle point (center, flat locally but curving down in one direction and up in another). BGD might get stuck in the local minimum or slow down at the saddle point.
While optimization algorithms can't guarantee finding the global minimum, the deterministic nature of BGD makes it potentially more susceptible to getting permanently trapped compared to methods introducing more randomness.
Batch Gradient Descent requires the entire dataset to be available before any learning can occur. This makes it unsuitable for scenarios where data arrives sequentially, known as online learning. In such cases, the model needs to adapt as new data comes in, without storing and reprocessing all historical data for every update.
These limitations, particularly the computational cost and the potential for getting stuck, highlight the need for more practical optimization algorithms in deep learning. This motivates the move towards methods that use smaller subsets of data for each update, leading us to Stochastic Gradient Descent and Mini-batch Gradient Descent.
© 2025 ApX Machine Learning