First-order methods like gradient descent rely solely on the slope (gradient) of the loss function to determine the update direction. While effective, they can be slow to converge, especially in areas with complex curvature, such as narrow valleys or near plateaus. Second-order methods incorporate information about the function's curvature, aiming for more direct and potentially faster convergence towards a minimum.
Newton's method is the foundational second-order optimization algorithm. Its core idea is to iteratively approximate the objective function around the current iterate using a simpler function, specifically a quadratic model, and then find the minimum of that model. This minimum serves as the next iterate .
Recall the second-order Taylor expansion of a function around a point :
Here, is the step we want to take from . is the gradient vector of at , and is the Hessian matrix of at , containing all second partial derivatives. Let's denote the Hessian at as .
Newton's method defines a quadratic model of the function at :
To find the best step , we seek the value that minimizes this quadratic model . Assuming has a unique minimum (which requires the Hessian to be positive definite), we can find it by setting the gradient of with respect to equal to zero:
Solving this linear system for the step , which we call the Newton step , yields:
Assuming the Hessian matrix is invertible, we get:
This represents the direction and magnitude of the step that leads to the minimum of the local quadratic approximation. The Newton's method update rule is then defined as:
Compare this to the gradient descent update: , where is the learning rate. Gradient descent always moves in the direction of the negative gradient, the direction of steepest local descent. Newton's method modifies this direction by multiplying the negative gradient with the inverse Hessian .
The Hessian captures the curvature of the function at . Multiplying by its inverse effectively rescales the gradient components based on this curvature. In directions where the curvature is high (function increases sharply), the step size is reduced. In directions where the curvature is low (function is flatter), the step size is increased. This leads the algorithm to take more direct steps towards the minimum, especially in elongated valleys where gradient descent tends to zigzag.
If the objective function is itself a strictly convex quadratic function, Newton's method will find the exact minimum in a single step from any starting point, because the quadratic model will be identical to .
Flow diagram illustrating the computation involved in a single step of Newton's method. Note the dependency on computing both the gradient and the Hessian, followed by a matrix inversion.
Under specific conditions, Newton's method exhibits very fast local quadratic convergence. This means that once the iterates are sufficiently close to the actual minimum , the error decreases quadratically at each step. Formally, if , then quadratic convergence implies for some constant . This rapid convergence rate is a significant advantage over the typically linear convergence rate ( for ) of first-order methods.
However, this desirable convergence relies on several assumptions:
These assumptions, particularly the computational burden of forming and inverting the Hessian ( for dense matrices of size ) and the requirement for positive definiteness, motivate the development of modifications and alternatives, such as Quasi-Newton methods, which we will discuss next.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with