Building upon the idea of approximating the Hessian matrix introduced in Quasi-Newton methods, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm stands out as one of the most effective and widely used techniques. Unlike Newton's method, which requires computing and inverting the true Hessian ∇2f(x) at each step, BFGS incrementally builds an approximation to the inverse Hessian, denoted as Hk, using only first-order gradient information. This bypasses the significant computational burden associated with the true Hessian.
The Core Idea: Approximating the Inverse Hessian
The update direction in Newton's method is pk=−[∇2f(xk)]−1∇f(xk). BFGS aims to replace the costly inverse Hessian term [∇2f(xk)]−1 with a matrix Hk that is cheaper to compute and update. The search direction then becomes:
pk=−Hk∇f(xk)
The central challenge is how to update Hk to Hk+1 at each iteration so that it effectively captures curvature information, similar to the true inverse Hessian.
The Secant Equation
Quasi-Newton methods, including BFGS, are typically constrained by the secant equation. This equation relates the change in gradient ∇f to the change in the parameter vector x. Let sk=xk+1−xk be the step taken and yk=∇f(xk+1)−∇f(xk) be the change in gradients. A second-order Taylor expansion motivates the relationship:
∇f(xk+1)−∇f(xk)≈∇2f(xk+1)(xk+1−xk)
Rearranging this gives ∇2f(xk+1)sk≈yk. If we denote the Hessian approximation at step k+1 as Bk+1 (approximating ∇2f(xk+1)), the secant equation requires Bk+1sk=yk. Correspondingly, for the inverse Hessian approximation Hk+1=Bk+1−1, the secant equation becomes:
Hk+1yk=sk
This condition ensures that the updated approximation Hk+1 correctly relates the observed change in gradient yk to the step sk that was taken.
The BFGS Update Formula
The BFGS algorithm provides a specific rank-two update rule for the inverse Hessian approximation Hk. Given the current approximation Hk, the step sk, and the gradient difference yk, the next approximation Hk+1 is calculated as:
Hk+1=(I−ykTskskykT)Hk(I−ykTskykskT)+ykTskskskT
Let's dissect this formula:
- Input: It uses the previous inverse Hessian approximation Hk, the step vector sk, and the gradient difference vector yk.
- Curvature Condition: The update requires ykTsk>0. This condition is related to the curvature of the function along the search direction. It ensures that the update maintains the positive definiteness of Hk if the initial H0 was positive definite and an appropriate line search (satisfying Wolfe conditions) is used. Positive definiteness of Hk guarantees that the search direction pk=−Hk∇f(xk) is a descent direction.
- Rank-Two Update: The update modifies Hk by adding two symmetric rank-one matrices.
- Properties: The resulting Hk+1 is symmetric and satisfies the secant equation (Hk+1yk=sk).
While the formula looks complex, its structure is designed to incorporate the new curvature information contained in the pair (sk,yk) while minimally perturbing the previous approximation Hk in other directions.
The BFGS Algorithm Steps
The iterative process for BFGS optimization typically proceeds as follows:
- Initialization: Choose a starting point x0. Initialize the inverse Hessian approximation H0. A common choice is the identity matrix H0=I. Set iteration counter k=0.
- Compute Search Direction: Calculate the descent direction pk=−Hk∇f(xk).
- Line Search: Find an appropriate step size αk>0 along the direction pk. This is usually done using a line search algorithm that satisfies the Wolfe conditions, which helps ensure stability and maintain positive definiteness of the Hessian approximation. Common choices include backtracking line search.
- Update Position: Update the parameter vector: xk+1=xk+αkpk.
- Check Convergence: Evaluate termination criteria (e.g., ∥∇f(xk+1)∥<ϵ, maximum iterations reached, small change in x or function value). If converged, stop.
- Compute Update Vectors: Define sk=xk+1−xk=αkpk and yk=∇f(xk+1)−∇f(xk).
- Check Curvature Condition: If ykTsk>0:
- Update Inverse Hessian: Compute Hk+1 using the BFGS update formula shown above.
- Else (Curvature Condition Fails): Skip the update (keep Hk+1=Hk) or use a different strategy to handle this situation. This is rare if a proper line search is used.
- Increment and Loop: Set k=k+1 and go back to step 2.
A diagram outlining the iterative steps involved in the BFGS optimization algorithm.
Advantages and Disadvantages
Advantages:
- No Hessian Calculation/Inversion: Like other Quasi-Newton methods, BFGS avoids the direct computation, storage, and inversion of the n×n Hessian matrix. It only requires gradient computations.
- Fast Convergence: BFGS typically exhibits superlinear convergence rates (faster than linear, slower than quadratic) when close to a minimum, making it much faster than first-order methods like gradient descent for many problems.
- Robustness: It is generally considered one of the most robust and effective general-purpose optimization algorithms for smooth, unconstrained problems.
Disadvantages:
- Memory Cost: The primary drawback of BFGS is the need to store and update the dense n×n inverse Hessian approximation Hk. For high-dimensional problems typical in machine learning (where n, the number of parameters, can be millions or billions), storing this matrix becomes infeasible due to memory requirements (O(n2)).
- Computational Cost per Iteration: While cheaper than Newton's method, updating Hk involves matrix-vector and outer product operations, costing O(n2) computation per iteration. Computing the search direction pk=−Hk∇f(xk) also takes O(n2) operations. This can still be prohibitive for very large n.
- Line Search Requirement: BFGS relies on a reasonably accurate line search (often satisfying the Wolfe conditions) to guarantee convergence and maintain the positive definiteness of Hk.
The significant memory and computational costs associated with the dense Hk matrix in high dimensions directly motivate the development of Limited-memory BFGS (L-BFGS), which we will discuss in the next section. L-BFGS retains many benefits of BFGS while drastically reducing the memory footprint.