While the BFGS algorithm offers a significant improvement over Newton's method by avoiding direct Hessian computation and inversion, it still requires storing and updating an approximation of the Hessian (or its inverse), which is a dense d×d matrix, where d is the number of parameters. For machine learning models with millions or even billions of parameters, storing a d×d matrix is completely infeasible due to memory constraints (O(d2) memory). This is where the Limited-memory BFGS (L-BFGS) algorithm comes into play.
The Core Idea: Trading Memory for Computation
L-BFGS cleverly sidesteps the O(d2) memory requirement of standard BFGS. Instead of storing the dense d×d Hessian approximation matrix Hk, L-BFGS only stores a small, fixed number, say m, of the most recent correction pairs (si,yi). Recall that:
- sk=xk+1−xk (change in parameters)
- yk=∇f(xk+1)−∇f(xk) (change in gradients)
Typically, m is a small integer, often between 5 and 20, regardless of the problem dimension d. This reduces the memory requirement drastically to O(md), which is linear in the number of parameters d and thus manageable even for very large models.
The trade-off is that instead of having direct access to the matrix Hk, L-BFGS computes the matrix-vector product Hk∇fk (which is needed to determine the search direction pk=−Hk∇fk) implicitly using only the stored (si,yi) pairs. This computation involves a procedure often called the "L-BFGS two-loop recursion."
The L-BFGS Two-Loop Recursion
The heart of L-BFGS is how it reconstructs the search direction pk=−Hk∇fk without ever forming the matrix Hk. It leverages the BFGS update formula, which can be expressed recursively. The algorithm computes the product Hk∇fk using the m most recent pairs {(si,yi)}i=k−mk−1.
The process involves:
- Initialization: Start with an initial Hessian approximation Hk0. A common choice is Hk0=γkI, where I is the identity matrix and γk is a scaling factor chosen to approximate the magnitude of the true Hessian. A popular choice is:
γk=yk−1Tyk−1sk−1Tyk−1
This scaling helps ensure the search direction has an appropriate magnitude.
- First Loop (Backward Pass): Iterate backward from i=k−1 down to k−m. In this loop, the algorithm computes intermediate values αi and updates the gradient vector q=∇fk.
- ρi=1/(yiTsi)
- αi=ρisiTq
- q=q−αiyi
- Scaling: Multiply the intermediate vector q by the initial Hessian approximation Hk0: z=Hk0q.
- Second Loop (Forward Pass): Iterate forward from i=k−m up to k−1. This loop uses the stored αi values and the (si,yi) pairs to reconstruct the final direction.
- βi=ρiyiTz
- z=z+si(αi−βi)
- Result: The final vector z is the desired product, z=Hk∇fk. The search direction is then pk=−z.
This two-loop procedure requires only vector operations (dot products, scalar-vector multiplications, vector additions), resulting in a computational cost of approximately O(md) per iteration, significantly cheaper than the O(d2) cost of standard BFGS updates or the O(d3) cost of Newton's method.
L-BFGS Algorithm Outline
A typical iteration of the L-BFGS algorithm looks like this:
- Choose starting point x0.
- Set memory size m.
- For k=0,1,2,... until convergence:
a. Compute the gradient ∇fk=∇f(xk).
b. Compute the search direction pk=−Hk∇fk using the two-loop recursion with the latest m pairs (si,yi) and an initial Hessian approximation Hk0.
c. Find a suitable step length αk>0 using a line search procedure (e.g., backtracking line search satisfying the Wolfe conditions).
d. Update the parameters: xk+1=xk+αkpk.
e. Compute the new gradient ∇fk+1=∇f(xk+1).
f. Calculate sk=xk+1−xk and yk=∇fk+1−∇fk.
g. Curvature Check: If skTyk>ϵ (for some small ϵ>0), store the pair (sk,yk). If the number of stored pairs exceeds m, discard the oldest pair (sk−m,yk−m).
h. Check convergence criteria (e.g., ∣∣∇fk+1∣∣<tolerance).
Advantages of L-BFGS
- Memory Efficiency: Its primary advantage. O(md) memory makes it suitable for problems with a vast number of parameters where storing O(d2) is impossible.
- Computational Efficiency: O(md) computation per iteration is much faster than standard BFGS or Newton's method for large d.
- Good Convergence: Often achieves superlinear convergence rates, typically converging much faster (in terms of iterations) than first-order methods like Gradient Descent or Conjugate Gradient on many deterministic problems.
Disadvantages and Considerations
- Hyperparameter m: The size of the memory buffer, m, affects performance. A larger m stores more curvature information, potentially leading to better search directions and faster convergence (fewer iterations), but increases the memory usage and computational cost per iteration. Typical values are small (e.g., 5-20).
- Line Search: Still requires a line search algorithm to find the step size αk, which can sometimes be computationally intensive itself, requiring multiple function and gradient evaluations per iteration.
- Curvature Condition: The update relies on the condition skTyk>0, which implies that the function's curvature is positive in the direction of sk. In non-convex regions, this might not hold. Implementations typically skip the update if this condition isn't met.
- Batch Method: L-BFGS, like standard BFGS and Newton's method, is fundamentally designed for deterministic (batch) optimization where the true gradient is computed using the entire dataset. Adapting it effectively to the stochastic settings common in deep learning (using mini-batches) is challenging, though some variants exist.
L-BFGS in Practice
L-BFGS is a workhorse algorithm in many scientific computing and optimization domains. In machine learning, it's often effective for:
- Training models where computing the full gradient is feasible (e.g., logistic regression, linear SVMs on moderately sized datasets).
- Problems where high precision is required near the optimum.
- Sometimes used as a final tuning step after initial training with stochastic methods like Adam or SGD.
While adaptive first-order methods like Adam are generally preferred for training large deep neural networks due to their compatibility with mini-batching and ability to navigate complex non-convex landscapes, understanding L-BFGS provides valuable insight into how curvature information can be leveraged for optimization, even under memory constraints. It represents a powerful bridge between first-order simplicity and second-order efficiency. Implementations are readily available in libraries like SciPy (scipy.optimize.minimize(method='L-BFGS-B')
) and are used internally by some ML frameworks.