While Newton's method offers the appealing prospect of quadratic convergence near a minimum by utilizing curvature information, its direct application in large-scale machine learning runs into significant practical roadblocks. These challenges stem primarily from the computation, storage, and properties of the Hessian matrix, ∇2f(x).
The first major hurdle is the sheer computational expense involved.
Forming the Hessian: Calculating the Hessian matrix requires computing all second-order partial derivatives of the loss function f with respect to the parameters x. If our model has d parameters (which can easily be millions or billions in modern deep learning), the Hessian is a d×d matrix. Computing each of the d2 elements can be computationally demanding. While techniques like automatic differentiation can help, the complexity often scales quadratically with the number of parameters, roughly O(d2) operations per iteration just to compute the Hessian.
Inverting the Hessian: Newton's update step requires solving the linear system ∇2f(xk)pk=−∇f(xk) for the search direction pk. This is equivalent to computing pk=−[∇2f(xk)]−1∇f(xk). Standard methods for matrix inversion or solving dense linear systems typically have a computational complexity of O(d3) operations.
For models with even a moderate number of parameters (e.g., tens of thousands), an O(d3) operation per iteration becomes computationally infeasible, far outweighing the cost of first-order methods which typically have a per-iteration cost of O(d) (dominated by gradient calculation).
Computational flow of a Newton's method iteration, highlighting the significant costs associated with the Hessian matrix (d is the number of parameters).
Beyond computation, storing the Hessian matrix itself presents a challenge. A d×d matrix requires O(d2) memory. If d is in the millions, storing terabytes of data just for the Hessian is often impractical, especially if computation is distributed across multiple devices with limited memory. First-order methods, in contrast, only need to store the parameters and the gradient, typically requiring O(d) memory.
Newton's method is derived by minimizing a quadratic approximation of the function f around the current point xk. This derivation implicitly assumes that the quadratic approximation has a unique minimum, which mathematically corresponds to the Hessian matrix ∇2f(xk) being positive definite (all its eigenvalues are positive).
In many machine learning problems, especially non-convex ones like training deep neural networks, this assumption often fails:
When the Hessian is not positive definite, the pure Newton step can lead to divergence or convergence to saddle points or local maxima instead of minima. Modifications like adding a multiple of the identity matrix to the Hessian (Levenberg-Marquardt type adjustments) or using trust-region approaches are needed to handle these cases, adding complexity to the algorithm.
These significant computational, storage, and theoretical hurdles mean that pure Newton's method is rarely used directly for training large-scale machine learning models. However, the underlying idea of using curvature information is powerful. This motivates the development of Quasi-Newton methods, such as BFGS and L-BFGS, which approximate the Hessian or its inverse efficiently without explicitly forming, storing, and inverting the full matrix. We will explore these more practical alternatives in the following sections.
© 2025 ApX Machine Learning