Training deep neural networks often feels different from optimizing simpler machine learning models like linear regression or SVMs. Much of this difference stems from the nature of the function we're trying to minimize, the loss landscape. Unlike the relatively well-behaved convex bowls we encounter in simpler settings, the loss landscape of a deep network is a complex, high-dimensional terrain full of unexpected features. Understanding these characteristics is fundamental to appreciating why certain optimization strategies work better than others in deep learning.
The most striking feature is the sheer dimensionality. Modern neural networks can have millions, even billions, of parameters (weights and biases). Each parameter represents a dimension in the optimization space. Visualizing a function in billions of dimensions is impossible, so our intuition, often built on 2D or 3D examples, can be misleading.
Mathematically, the loss function L(θ) for a deep network, where θ represents the vector of all parameters, is almost always non-convex. This means that unlike convex functions (which look like a single bowl), the loss landscape can have multiple local minima: points where the loss is lower than in their immediate vicinity, but not necessarily the lowest loss achievable across the entire landscape (the global minimum).
Why non-convexity? The non-linear activation functions (like ReLU, sigmoid, tanh) used in successive layers and the compositional structure of the network (output=fL(...f2(f1(input;θ1);θ2)...;θL)) combine to create a highly complex, non-linear relationship between the parameters θ and the final loss L. Simple operations like swapping two identical hidden units in a layer can produce the exact same network output but correspond to a different point in the parameter space θ, immediately implying the existence of multiple equivalent minima and thus non-convexity.
While the existence of many local minima was initially thought to be the primary difficulty in non-convex optimization, research suggests that for high-dimensional problems typical of deep learning, saddle points might be a more significant impediment.
A saddle point is a location where the gradient ∇L(θ) is zero (or very close to zero), just like at a minimum or maximum, but it's not a local extremum. Instead, the function curves up in some directions and curves down in others, like a horse's saddle or a mountain pass. Mathematically, at a saddle point, the Hessian matrix ∇2L(θ) (the matrix of second partial derivatives) has both positive and negative eigenvalues.
Why are saddle points problematic?
Simple SGD can struggle to navigate saddle points efficiently. Methods incorporating momentum or adaptive learning rates (covered in Chapter 3) often handle saddle points better by accumulating velocity or adjusting step sizes based on gradient history, helping to "roll through" the flat regions associated with saddles. Second-order methods, which use Hessian information, can explicitly identify escape directions (negative curvature directions), but computing the Hessian is usually too expensive for large networks.
A simplified 2D projection illustrating potential features: local minima, a saddle point (flat gradient, but curvature changes), and flatter plateau regions. Real landscapes exist in vastly higher dimensions.
Another common feature is the existence of large, relatively flat regions known as plateaus. In these areas, the loss changes very slowly across significant distances in parameter space, meaning the gradient ∇L(θ) is consistently small. Optimization can become extremely slow on plateaus, similar to the slowdown near saddle points. This phenomenon is sometimes linked to the vanishing gradient problem in very deep networks, where gradients struggle to propagate back through many layers. Techniques like normalization (discussed later in this chapter) and residual connections were developed partly to mitigate these flat regions and improve gradient flow.
Not all minima are created equal in terms of generalization. Imagine two minima, A and B, that achieve roughly the same low loss value on the training data.
There's growing evidence suggesting that optimizers converging to flatter minima often lead to models that generalize better to unseen data. The intuition is that if the loss landscape is flat around the solution, small variations in the input data (leading to slightly different optimal parameter settings) are less likely to cause large increases in the loss or significant changes in the model's predictions.
Interestingly, the choice of optimization algorithm and its parameters (like batch size and learning rate) can influence whether the optimizer tends to find sharp or flat minima. For example, using larger batch sizes in SGD often correlates with convergence to sharper minima, while smaller batch sizes (introducing more noise) might explore more and find flatter minima. Adaptive methods like Adam are also studied in this context, though the relationships are complex and subject to ongoing research.
Understanding these diverse characteristics, high dimensionality, non-convexity, the prevalence of saddle points over problematic local minima, plateaus, and the distinction between sharp and flat minima, is essential. They explain why optimizing deep networks is challenging and why specific techniques, from adaptive learning rates to normalization layers and careful initialization, have become standard practice. These features directly influence the behavior, convergence speed, and generalization performance of the optimization algorithms we employ.
© 2025 ApX Machine Learning