Gradient descent confidently steps "downhill" based on the local gradient information. But what happens if the terrain isn't a simple, smooth bowl leading straight to the lowest point? The optimization landscape for machine learning cost functions, especially complex ones, can be surprisingly bumpy. This leads to potential pitfalls: local minima and saddle points.
Imagine you're hiking down a mountain range in thick fog, always choosing the steepest downward path available right where you stand. You might find yourself in a small valley. It feels like the bottom because all paths immediately around you lead upwards. However, you might be unaware that a much deeper valley, the true lowest point in the range (the global minimum), exists elsewhere.
This is the essence of a local minimum. It's a point θ∗ in the parameter space where the cost function J(θ∗) is lower than at all nearby points, but there might be another point θ∗∗ far away where J(θ∗∗) is even lower (the global minimum).
Why does gradient descent get stuck? At a local minimum (and also at a global minimum), the slope of the cost function is flat in all directions. Mathematically, the gradient is zero: ∇J(θ∗)=0 Recall the gradient descent update rule: θ:=θ−α∇J(θ) When ∇J(θ)=0, the update term α∇J(θ) becomes zero, and the parameters θ stop changing. The algorithm converges, but possibly to a suboptimal solution represented by the parameters of the local minimum.
{"data": [{"x": [-3, -2.5, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3], "y": [4, 1.25, 0, 0.25, 1, 1.25, 0.5, 0.125, 0, 0.625, 2.25, 4.375, 7], "mode": "lines", "name": "Cost Function J(θ)", "line": {"color": "#228be6"}}, {"x": [-2, 1], "y": [0, 0], "mode": "markers", "name": "Minima", "marker": {"color": ["#f03e3e", "#37b24d"], "size": 10, "symbol": ["circle", "star"]}}], "layout": {"title": {"text":"Cost Function with Local and Global Minima", "x": 0.5, "xanchor": "center"}, "xaxis": {"title": "Parameter θ"}, "yaxis": {"title": "Cost J(θ)"}, "showlegend": True, "legend": {"traceorder": "normal", "items": [{"name": "Local Minimum", "visible": True, "color": "#f03e3e", "symbol": "circle"}, {"name": "Global Minimum", "visible": True, "color": "#37b24d", "symbol": "star"}]}}}
Example of a cost function J(θ) with a local minimum (red circle) and a global minimum (green star). Gradient descent starting in the left basin might converge to the local minimum, as the gradient there becomes zero.
For simpler models like linear regression with a convex cost function (shaped like a single bowl), local minima aren't an issue; there's only one global minimum. However, for complex models like neural networks, the cost surface is high-dimensional and non-convex, potentially containing many local minima. Interestingly, research suggests that for very large neural networks, many local minima might yield performance nearly as good as the global minimum, making them less of a practical hindrance than initially feared. However, poor local minima can still exist and trap the optimization process, especially in smaller networks or specific problem types.
Another, often more troublesome, feature of high-dimensional optimization landscapes is the saddle point. Imagine the shape of a horse's saddle: it curves down along the horse's spine but curves up across the horse's back. At the very center of the saddle, the ground is flat.
Mathematically, a saddle point is a point θ∗ where the gradient is zero, ∇J(θ∗)=0, just like a minimum. However, it's not a minimum because, although the function increases in some directions (like going up the sides of the saddle), it decreases in others (like sliding off the front or back). Gradient descent gets stuck here for the same reason it gets stuck at a minimum: the zero gradient halts the updates.
{"data": [{"type": "surface", "x": [-2,-1.5,-1,-0.5,0,0.5,1,1.5,2]*9, "y": [-2]*9 + [-1.5]*9 + [-1]*9 + [-0.5]*9 + [0]*9 + [0.5]*9 + [1]*9 + [1.5]*9 + [2]*9, "z": [4,2.25,1,0.25,0,0.25,1,2.25,4, 0, -1.75, -3, -3.75, -4, -3.75, -3, -1.75, 0, -2, -3.75, -5, -5.75, -6, -5.75, -5, -3.75, -2, -2, -3.75, -5, -5.75, -6, -5.75, -5, -3.75, -2, 0, -1.75, -3, -3.75, -4, -3.75, -3, -1.75, 0, 2, 0.25, -1, -1.75, -2, -1.75, -1, 0.25, 2, 4, 2.25, 1, 0.25, 0, 0.25, 1, 2.25, 4, 6, 4.25, 3, 2.25, 2, 2.25, 3, 4.25, 6, 8, 6.25, 5, 4.25, 4, 4.25, 5, 6.25, 8], "colorscale": [[0, "#1c7ed6"],[0.5, "#dee2e6"],[1, "#f03e3e"]], "opacity":0.9, "contours": {"z": {"show": True, "usecolormap": True, "highlightcolor": "#4263eb", "project": {"z": True}}}}], "layout": {"title": {"text":"Saddle Point Example (J ≈ θ₁² - θ₂²)", "x": 0.5, "xanchor": "center"}, "scene": {"xaxis":{"title":"θ₁", "backgroundcolor": "#e9ecef"}, "yaxis":{"title":"θ₂", "backgroundcolor": "#e9ecef"}, "zaxis":{"title":"Cost J(θ₁, θ₂)", "backgroundcolor": "#e9ecef"}, "aspectratio": {"x": 1, "y": 1, "z": 0.7}}, "width": 650, "height": 550, "margin": {"l": 10, "r": 10, "b": 10, "t": 40}}}
A surface plot illustrating a saddle point near (0,0). The function decreases along one direction (blue indicates lower cost) but increases along another (red indicates higher cost). The gradient at the exact saddle point is zero, causing standard gradient descent to halt.
Why are saddle points often considered more problematic than local minima in high-dimensional spaces (like those encountered in deep learning)?
Standard gradient descent, relying only on the first derivative, has difficulty escaping saddle points because the gradient gives no clear indication of the downward curving directions when it's near zero. The algorithm might oscillate back and forth or crawl very slowly along the plateau-like region defined by the saddle point.
While standard gradient descent only uses first-order derivative information (the gradient ∇J), second-order derivatives (captured by the Hessian matrix H, as discussed in Chapter 3) provide information about the curvature of the cost function, which distinguishes these points:
Using the Hessian directly for optimization (like in Newton's method) is often computationally too expensive for the large number of parameters in typical machine learning models. However, this theoretical distinction helps explain why these different types of points with zero gradients behave differently and why saddle points can be particularly tricky for first-order methods like gradient descent.
While local minima and saddle points represent genuine challenges in optimization, they don't necessarily prevent us from training effective models. Variants of the gradient descent algorithm, such as those incorporating momentum or using adaptive learning rates (like RMSprop or Adam, common optimizers in deep learning frameworks), are specifically designed to help navigate these difficult regions more effectively. These methods often accumulate information about past gradients or adjust the learning rate for each parameter individually. This helps them build "speed" to pass through flat regions or saddle points and dampens oscillations, allowing them to escape saddle points more readily and potentially find better minima. We won't cover those algorithms in detail here, but it's useful to know that solutions and improvements exist.
Understanding these potential hurdles is an important part of effectively applying and troubleshooting gradient-based optimization in machine learning. Recognizing that your optimization process might be slowing down drastically near a saddle point or appears stuck in a local minimum allows you to consider alternative optimization algorithms, adjust hyperparameters like the learning rate, or investigate other aspects of your model or training process.
© 2025 ApX Machine Learning