First-order methods like gradient descent rely solely on the slope (gradient) of the loss function to determine the update direction. While effective, they can be slow to converge, especially in areas with complex curvature, such as narrow valleys or near plateaus. Second-order methods incorporate information about the function's curvature, aiming for more direct and potentially faster convergence towards a minimum.
Newton's method is the foundational second-order optimization algorithm. Its core idea is to iteratively approximate the objective function f(x) around the current iterate xk using a simpler function, specifically a quadratic model, and then find the minimum of that model. This minimum serves as the next iterate xk+1.
Recall the second-order Taylor expansion of a function f:Rn→R around a point xk:
f(xk+p)≈f(xk)+∇f(xk)Tp+21pT∇2f(xk)pHere, p=x−xk is the step we want to take from xk. ∇f(xk) is the gradient vector of f at xk, and ∇2f(xk) is the Hessian matrix of f at xk, containing all second partial derivatives. Let's denote the Hessian at xk as Hk=∇2f(xk).
Newton's method defines a quadratic model mk(p) of the function f at xk+p:
mk(p)=f(xk)+∇f(xk)Tp+21pTHkpTo find the best step p, we seek the value pk that minimizes this quadratic model mk(p). Assuming mk(p) has a unique minimum (which requires the Hessian Hk to be positive definite), we can find it by setting the gradient of mk(p) with respect to p equal to zero:
∇pmk(p)=∇f(xk)+Hkp=0Solving this linear system for the step p, which we call the Newton step pk, yields:
Hkpk=−∇f(xk)Assuming the Hessian matrix Hk is invertible, we get:
pk=−Hk−1∇f(xk)This pk represents the direction and magnitude of the step that leads to the minimum of the local quadratic approximation. The Newton's method update rule is then defined as:
xk+1=xk+pk=xk−Hk−1∇f(xk)Compare this to the gradient descent update: xk+1=xk−α∇f(xk), where α is the learning rate. Gradient descent always moves in the direction of the negative gradient, the direction of steepest local descent. Newton's method modifies this direction by multiplying the negative gradient with the inverse Hessian Hk−1.
The Hessian Hk captures the curvature of the function f at xk. Multiplying by its inverse effectively rescales the gradient components based on this curvature. In directions where the curvature is high (function increases sharply), the step size is reduced. In directions where the curvature is low (function is flatter), the step size is increased. This leads the algorithm to take more direct steps towards the minimum, especially in elongated valleys where gradient descent tends to zigzag.
If the objective function f(x) is itself a strictly convex quadratic function, Newton's method will find the exact minimum in a single step from any starting point, because the quadratic model mk(p) will be identical to f(xk+p).
Flow diagram illustrating the computation involved in a single step of Newton's method. Note the dependency on computing both the gradient and the Hessian, followed by a matrix inversion.
Under specific conditions, Newton's method exhibits very fast local quadratic convergence. This means that once the iterates xk are sufficiently close to the actual minimum x∗, the error decreases quadratically at each step. Formally, if ek=∥xk−x∗∥, then quadratic convergence implies ∥ek+1∥≤C∥ek∥2 for some constant C. This rapid convergence rate is a significant advantage over the typically linear convergence rate (ek+1≤cek for c<1) of first-order methods.
However, this desirable convergence relies on several assumptions:
These assumptions, particularly the computational burden of forming and inverting the Hessian (O(n3) for dense matrices of size n×n) and the requirement for positive definiteness, motivate the development of modifications and alternatives, such as Quasi-Newton methods, which we will discuss next.
© 2025 ApX Machine Learning