As introduced, second-order methods leverage curvature information to potentially achieve faster convergence than first-order methods. The core mathematical object capturing this curvature is the Hessian matrix.
Defining the Hessian Matrix
For a scalar-valued function f:Rn→R representing our loss function with n parameters (denoted by the vector x), the Hessian matrix, typically denoted as ∇2f(x) or simply H, is the n×n matrix of second-order partial derivatives:
Each entry (H)ij contains the partial derivative ∂xi∂xj∂2f. This matrix generalizes the concept of the second derivative from single-variable calculus to multivariate functions. It essentially describes the "curvature" or the "slope of the slope" of the loss function in all possible directions within the parameter space.
What the Hessian Tells Us: Curvature and Local Geometry
The Hessian matrix provides rich information about the local shape of the loss surface around a point x. Its properties, particularly its definiteness (related to the signs of its eigenvalues), are directly linked to the local geometry:
Positive Definite Hessian: All eigenvalues are positive (λi>0). This indicates that the function is locally convex around x, resembling a bowl shape pointing upwards. A point x where ∇f(x)=0 and ∇2f(x) is positive definite is a strict local minimum.
Negative Definite Hessian: All eigenvalues are negative (λi<0). This signifies local concavity, like an upside-down bowl. A point x where ∇f(x)=0 and ∇2f(x) is negative definite is a strict local maximum.
Indefinite Hessian: Some eigenvalues are positive, and some are negative. This occurs at saddle points, which curve upwards in some directions and downwards in others. Optimization algorithms can get stuck or slow down near saddle points.
Positive Semidefinite Hessian: All eigenvalues are non-negative (λi≥0), with at least one eigenvalue being zero. This can indicate a flat region or a valley floor, potentially leading to a non-strict minimum.
Negative Semidefinite Hessian: All eigenvalues are non-positive (λi≤0), with at least one eigenvalue being zero. This can indicate a flat region or a ridge crest, potentially leading to a non-strict maximum.
Relationship between Hessian definiteness (eigenvalues λ) and the local geometry of the loss surface at a stationary point (where the gradient is zero).
In the context of Newton's method, the Hessian ∇2f(xk) is used to build the quadratic approximation of the function around the current iterate xk. The method then jumps to the minimum of this quadratic approximation. If the Hessian is positive definite, this jump effectively uses the curvature information to point more directly towards the true minimum compared to the steepest descent direction given by the negative gradient alone.
Computing the Hessian
Calculating the Hessian matrix presents significant computational hurdles, especially in machine learning where the number of parameters n can be extremely large (millions or billions).
Analytical Calculation: If you have the exact mathematical expression for the loss function f(x), you can derive the formulas for all n2 second partial derivatives. This is often feasible for simpler models but quickly becomes complex or intractable for deep neural networks.
Automatic Differentiation (AutoDiff): Modern deep learning frameworks utilize automatic differentiation. While primarily used for gradients (first derivatives), AutoDiff techniques can be extended to compute Hessian-vector products (Hv) efficiently without forming the full Hessian. Computing the full Hessian via AutoDiff typically involves applying gradient computation recursively or using specialized techniques, but it remains computationally expensive. The cost scales roughly as O(n) times the cost of computing the gradient, leading to an overall cost proportional to O(n2) operations for the full matrix.
Finite Differences: One can approximate second derivatives using finite differences, for example:
∂xi∂xj∂2f≈hkf(x+hei+kej)−f(x+hei)−f(x+kej)+f(x)
where ei and ej are standard basis vectors, and h,k are small step sizes. This requires multiple function evaluations for each entry and suffers from numerical precision issues and high computational cost (O(n2) function evaluations). It's rarely used for training large models.
Properties and Practical Challenges
The Hessian matrix has several important properties and associated challenges in practical optimization:
Symmetry: For most loss functions encountered in machine learning (assuming sufficient smoothness), the order of differentiation does not matter (Schwarz's theorem or Clairaut's theorem), meaning ∂xi∂xj∂2f=∂xj∂xi∂2f. Thus, the Hessian matrix is symmetric (H=HT). This property reduces the number of unique elements to compute and store from n2 to n(n+1)/2.
Computational Cost: As mentioned, computing the full Hessian requires calculating O(n2) second derivatives. For a model with millions of parameters (n≈106), n2 is 1012, making direct computation infeasible.
Storage Cost: Storing the full n×n Hessian matrix requires O(n2) memory. Again, for large n, this exceeds the memory capacity of typical hardware. A float32 Hessian for n=106 parameters would need roughly 4 Terabytes.
Inversion Cost: Standard Newton's method requires computing the inverse of the Hessian, H−1, or solving the linear system Hp=−∇f. Direct inversion or solving using standard methods like Gaussian elimination costs O(n3) operations, which is even more prohibitive than computing or storing the Hessian.
Non-Positive Definiteness: If the Hessian is not positive definite (i.e., it's indefinite or negative definite), the direction p=−H−1∇f calculated by Newton's method may not be a descent direction (it might point towards a maximum or saddle point), or the inverse may not even exist (if the Hessian is singular). This requires modifications to Newton's method, such as damping or using trust regions, or switching to methods that don't require explicit inversion or positive definiteness.
These challenges, particularly the computational and storage costs (O(n2)) and the inversion cost (O(n3)), make pure Newton's method impractical for large-scale machine learning. This motivates the development of Quasi-Newton methods, discussed next, which approximate the Hessian or its inverse using only first-order gradient information, aiming to retain some benefits of second-order information without the prohibitive costs.