Building upon first-order gradient methods, this chapter examines optimization techniques that utilize second-order derivative information. These methods consider the curvature of the loss function, often leading to faster convergence compared to gradient descent, particularly when close to a minimum.
We will begin with Newton's method, exploring its theoretical basis which involves approximating the objective function locally with a quadratic model:
f(xk+p)≈f(xk)+∇f(xk)Tp+21pT∇2f(xk)pThe core component here is the Hessian matrix ∇2f(xk). We will discuss its properties and the computational challenges associated with its calculation and inversion, especially for high-dimensional machine learning problems.
To address these challenges, we will focus on Quasi-Newton methods, which approximate the Hessian or its inverse. You will learn about the popular BFGS algorithm and its memory-efficient adaptation, L-BFGS, widely used in practice. Additionally, we will cover trust-region methods, offering an alternative strategy for managing step sizes using curvature information. The chapter includes a practical section on implementing L-BFGS.
2.1 Newton's Method: Theory and Derivation
2.2 The Hessian Matrix: Computation and Properties
2.3 Challenges with Newton's Method
2.4 Quasi-Newton Methods: Approximating the Hessian
2.5 BFGS Algorithm Explained
2.6 Limited-memory BFGS (L-BFGS)
2.7 Trust Region Methods
2.8 Hands-on Practical: Implementing L-BFGS
© 2025 ApX Machine Learning