While convex optimization problems offer comforting guarantees of finding a single, global minimum, the reality of most modern machine learning tasks, particularly deep learning, is decidedly non-convex. When the condition f(λx+(1−λ)y)≤λf(x)+(1−λ)f(y) doesn't hold globally, the optimization landscape becomes significantly more complex and treacherous to navigate. Understanding these challenges is fundamental before exploring the algorithms designed to mitigate them.
The absence of convexity introduces several difficulties that standard gradient descent methods struggle with:
In a non-convex landscape, there can be multiple points where the gradient is zero, ∇f(x)=0, and which are minimal within a certain neighborhood, but not necessarily the point with the lowest overall function value (the global minimum). A first-order method like gradient descent follows the direction of steepest descent and will stop once it reaches any point where the gradient is zero. If this point is a local minimum, the algorithm may converge to a suboptimal solution without any indication that a better solution exists elsewhere in the parameter space. The final performance of a model trained with gradient descent can, therefore, depend heavily on the initialization point, as different starting points might lead the optimizer into different basins of attraction corresponding to different local minima.
Perhaps even more pervasive than local minima in high-dimensional non-convex problems are saddle points. A saddle point is also a critical point where the gradient is zero, ∇f(x)=0, but it is not a local extremum. Instead, the function increases in some directions and decreases in others, resembling the shape of a horse's saddle.
A simplified illustration of a non-convex surface featuring a saddle point (red X) where the gradient is zero but it's not a minimum, and local minima (green dots). Simple gradient descent can slow down or get stuck near these points.
While second-order methods using the Hessian matrix can distinguish between minima, maxima, and saddle points by examining eigenvalues, first-order methods lack this information. Near a saddle point, the gradient becomes very small along all directions, causing methods like SGD to slow down drastically, sometimes appearing to have converged. Escaping saddle points efficiently is a significant challenge, especially since theoretical studies suggest they become exponentially more common than local minima as the dimensionality of the problem increases, which is typical for deep neural networks.
Plateaus are large, relatively flat regions of the loss surface where the gradient is consistently close to zero, but the function value itself might still be high. When an optimizer enters a plateau, progress becomes extremely slow because the small gradient leads to tiny updates, regardless of the learning rate. This can make training times prohibitively long. Plateaus are often associated with phenomena like vanishing gradients in deep networks.
The loss surfaces of large machine learning models exist in extremely high-dimensional spaces (millions or billions of parameters). Our intuition, based on 2D or 3D visualizations, often fails us here. Properties that are rare in low dimensions (like saddle points with many negative curvature directions) can become dominant. The geometry can be incredibly complex, featuring mixtures of sharp valleys, flat plateaus, and numerous suboptimal critical points. Navigating this landscape effectively requires optimization algorithms that are less likely to get permanently stuck or significantly slowed down by these features.
Addressing these non-convex optimization challenges is a primary motivator for developing the advanced techniques discussed in subsequent chapters, such as adaptive learning rates (Chapter 3), second-order approximations (Chapter 2), and specialized methods for deep learning (Chapter 6). While finding a provable global minimum is often intractable for complex models, the goal becomes finding sufficiently good local minima or navigating saddle points effectively to achieve strong model performance in a reasonable amount of time.
© 2025 ApX Machine Learning