In single-variable calculus, we found potential minima and maxima of a function f(x) by identifying critical points where the derivative f′(x)=0. At these points, the tangent line is horizontal, indicating that the function is momentarily flat. We then used the second derivative, f′′(x), to classify these points: f′′(x)>0 suggested a local minimum (concave up), while f′′(x)<0 suggested a local maximum (concave down).
We can extend this idea to functions of multiple variables, like f(x,y) or more generally f(x) where x is a vector of variables (x1,x2,...,xn).
For a multivariable function, the analogue of the derivative being zero is the gradient vector being the zero vector. A critical point (or stationary point) x0 of a function f(x) is a point where the gradient is zero:
∇f(x0)=0Remember that the gradient ∇f is a vector containing all the partial derivatives: ∇f=(∂x1∂f,∂x2∂f,...,∂xn∂f). So, the condition ∇f(x0)=0 means that all partial derivatives must be simultaneously zero at x0:
∂x1∂f(x0)=0,∂x2∂f(x0)=0,...,∂xn∂f(x0)=0What does this mean intuitively? The gradient points in the direction of the steepest ascent. If the gradient is the zero vector, it means there is no direction of ascent (or descent) from that point. The function is locally "flat" in all directions parallel to the input axes. Just like in the single-variable case, these critical points are candidates for being local minima, local maxima, or something else.
In the single-variable case, a zero derivative alone wasn't enough; we needed the second derivative test. For multivariable functions, the Hessian matrix, H, which contains all the second-order partial derivatives, plays the role of the second derivative.
H=∂x12∂2f∂x2∂x1∂2f⋮∂xn∂x1∂2f∂x1∂x2∂2f∂x22∂2f⋮∂xn∂x2∂2f⋯⋯⋱⋯∂x1∂xn∂2f∂x2∂xn∂2f⋮∂xn2∂2fEvaluating the Hessian matrix at a critical point x0 (where ∇f(x0)=0) helps us understand the function's curvature around that point. The Second Derivative Test for multivariable functions uses properties of the Hessian matrix H(x0):
For most optimization problems in machine learning, we are primarily concerned with finding local (or ideally global) minima. Saddle points represent a challenge for some optimization algorithms because the gradient is zero, potentially halting progress, even though it's not a minimum.
At the origin (0,0) of the function f(x,y)=x2−y2, the gradient ∇f=(2x,−2y) is (0,0). The Hessian is (200−2), which is indefinite. This indicates a saddle point: the function increases along the x-axis but decreases along the y-axis.
In machine learning, the function f we want to optimize is typically a cost function or loss function, which measures how poorly our model performs. The variables x are the model's parameters (weights and biases). Our goal is almost always to minimize this cost function.
Therefore, finding points where the gradient of the cost function with respect to the parameters is zero is a fundamental step. While the second derivative test using the Hessian provides a way to classify these points, calculating and analyzing the Hessian can be computationally expensive, especially for models with millions of parameters (like deep neural networks).
Furthermore, finding the points where ∇f=0 analytically by solving the system of partial derivative equations is usually intractable for complex ML models. This motivates the use of iterative optimization algorithms, like gradient descent (which we will explore in the next chapter), that use the gradient ∇f to navigate the cost surface and progressively move towards a minimum, without necessarily needing to compute the Hessian or explicitly solve ∇f=0. Understanding the concepts of gradients, critical points, and curvature, however, remains essential for understanding how and why these algorithms work, and what challenges (like saddle points) they might encounter.
© 2025 ApX Machine Learning