While gradient descent provides a fundamental mechanism for optimizing neural networks, the journey towards minimizing the loss function isn't always straightforward. The idealized image of smoothly rolling down a hill into the lowest valley can be misleading, especially in the high-dimensional spaces typical of deep learning models. Several challenges can impede or complicate the training process.
The loss surface of a neural network is rarely a simple convex bowl with a single minimum. It's often complex and non-convex, potentially containing numerous local minima. These are points where the loss is lower than in the immediate surrounding area, causing the gradient to become zero (∇L=0), but they are not the point of lowest possible loss (the global minimum).
Gradient descent, by its nature, follows the steepest path downwards. If it happens to land in a local minimum, the gradient becomes zero, and the algorithm stops updating the weights, effectively getting "stuck" even though a better solution might exist elsewhere on the loss surface.
A simple 1D loss surface illustrating points where the gradient is zero. Gradient descent might stop at either orange point (local minimum) instead of finding the lowest possible loss if a global minimum existed elsewhere.
While local minima were initially considered a major hurdle, research suggests that in the very high-dimensional spaces of deep networks, most critical points (where the gradient is zero) are actually saddle points, and sufficiently deep local minima are often close to the global minimum in terms of performance.
More common than local minima in high dimensions are saddle points. A saddle point is a location where the gradient is zero, but it's a minimum along some dimensions and a maximum along others. Imagine the shape of a horse's saddle: it dips down in the direction from front to back but curves up in the direction from side to side.
Near a saddle point, the gradients can become very small, causing gradient descent to slow down significantly or even stall for a long time before potentially finding a direction to escape. This slowing down can make training inefficient.
A surface plot illustrating a saddle point at (0, 0, 0). The loss decreases along one axis but increases along the other. Gradients are zero at this point.
The loss landscape can also feature long, narrow valleys or ravines with steep sides but a gentle slope along the valley floor. In such scenarios, gradient descent might oscillate back and forth across the steep walls of the ravine while making only very slow progress along the bottom towards the actual minimum. This happens because the gradient points much more steeply across the ravine than along it. Adjusting the learning rate can be difficult here; a rate suitable for the gentle slope might be too large for the steep sides, causing instability.
Contour plot showing a loss function with a narrow valley minimum around (1, 1). The red line illustrates how standard gradient descent might oscillate across the valley, making slow progress towards the minimum.
As discussed previously, the learning rate (η) is a critical hyperparameter. Choosing an appropriate value is essential:
Furthermore, gradient descent is sensitive to the scale of the input features. If features have vastly different ranges (e.g., one feature from 0-1, another from 1-10000), the loss surface can become elongated, exacerbating the ravine problem and making finding a single suitable learning rate difficult. This is why feature scaling (like normalization or standardization, covered in Chapter 5) is a standard preprocessing step.
These challenges highlight that while vanilla gradient descent is the conceptual basis for optimization, more sophisticated algorithms are often needed in practice. Techniques like Momentum, RMSprop, and Adam (which we will explore in the next chapter) are designed specifically to mitigate these issues, enabling faster and more reliable convergence on complex loss surfaces.
© 2025 ApX Machine Learning