Now that we've explored the theoretical underpinnings of Newton's method, BFGS, and the motivation behind L-BFGS, let's put theory into practice. This section provides hands-on experience using L-BFGS to solve an optimization problem, focusing on leveraging existing library implementations, which is the common approach in machine learning workflows. We'll compare its performance to simpler methods and understand its practical usage.
Recall that L-BFGS aims to capture the benefits of second-order information (curvature) like BFGS, but avoids the storage and computation burden of maintaining a full n×n Hessian approximation. It achieves this by storing only a limited history (typically 5 to 20) of the most recent step vectors sk=xk+1−xk and gradient difference vectors yk=∇f(xk+1)−∇f(xk). These vectors implicitly define the Hessian approximation used in the two-loop recursion to calculate the search direction pk=−Hk∇f(xk), where Hk is the inverse Hessian approximation.
For this practical exercise, let's use a standard machine learning problem: Binary Logistic Regression. The goal is to find the parameters θ that minimize the negative log-likelihood (or cross-entropy loss) function. Given a dataset (X,y) where X∈Rm×n (m samples, n features) and y∈{0,1}m, the objective function J(θ) is:
J(θ)=−m1i=1∑m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]where hθ(x)=σ(θTx) is the sigmoid function σ(z)=1/(1+e−z).
We also need the gradient ∇J(θ):
∇J(θ)j=m1i=1∑m(hθ(x(i))−y(i))xj(i)We'll use a synthetic dataset for simplicity, allowing us to focus on the optimizer's behavior.
import numpy as np
from scipy.optimize import minimize
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt # Using for basic plotting, Plotly for final comparison
# Generate synthetic data
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, n_redundant=2,
n_classes=2, random_state=42)
# Add intercept term
X = np.hstack((np.ones((X.shape[0], 1)), X))
n_features = X.shape[1]
# Split data (optional, good practice but not strictly needed for optimizer demo)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Using full dataset for simplicity here
X_train, y_train = X, y
m = X_train.shape[0]
# --- Objective Function and Gradient ---
def sigmoid(z):
# Clip z to avoid overflow/underflow in exp
z_clipped = np.clip(z, -500, 500)
return 1.0 / (1.0 + np.exp(-z_clipped))
def cost_function(theta, X, y):
h = sigmoid(X @ theta)
# Add small epsilon for numerical stability in log
epsilon = 1e-7
cost = -np.mean(y * np.log(h + epsilon) + (1 - y) * np.log(1 - h + epsilon))
return cost
def gradient(theta, X, y):
h = sigmoid(X @ theta)
grad = (X.T @ (h - y)) / len(y)
return grad
# Initial parameters
initial_theta = np.zeros(n_features)
print(f"Dataset shape: {X_train.shape}")
print(f"Initial cost: {cost_function(initial_theta, X_train, y_train):.4f}")
Most scientific computing libraries provide robust implementations of L-BFGS. We'll use scipy.optimize.minimize
, a versatile function that interfaces with various optimization algorithms. The specific variant often used is L-BFGS-B, which allows box constraints (lower and upper bounds on parameters), though we won't use constraints here.
To use scipy.optimize.minimize
with L-BFGS, we need to provide:
fun
: The objective function to minimize (our cost_function
).x0
: The initial guess for the parameters (our initial_theta
).args
: A tuple of additional arguments needed by the objective and gradient functions (our X_train
, y_train
).method
: Specify 'L-BFGS-B'.jac
: The function that computes the gradient vector (our gradient
). Providing the analytical gradient is much more efficient than relying on numerical approximations.options
: A dictionary for algorithm-specific parameters. Important ones for L-BFGS-B include:
maxcor
: The number of recent (s, y) pairs to store (the history size m
). Default is 10.gtol
: Gradient norm tolerance for convergence. The algorithm stops when max∣∇J(θ)i∣≤gtol.maxiter
: Maximum number of iterations.# --- Run L-BFGS ---
print("\nRunning L-BFGS...")
# Store history for plotting
lbfgs_cost_history = []
def callback_fn(theta):
"""Callback function to store cost at each iteration."""
cost = cost_function(theta, X_train, y_train)
lbfgs_cost_history.append(cost)
# Optional: print progress
# print(f"Iter {len(lbfgs_cost_history)}: Cost = {cost:.6f}")
lbfgs_result = minimize(fun=cost_function,
x0=initial_theta,
args=(X_train, y_train),
method='L-BFGS-B',
jac=gradient,
callback=callback_fn,
options={'maxcor': 15, # History size m
'gtol': 1e-6, # Gradient tolerance
'disp': True, # Display convergence messages
'maxiter': 200
}
)
# --- Results ---
print("\nL-BFGS Results:")
if lbfgs_result.success:
print(f"Optimization successful: {lbfgs_result.message}")
print(f"Final cost: {lbfgs_result.fun:.6f}")
print(f"Number of iterations: {lbfgs_result.nit}")
print(f"Number of function evaluations: {lbfgs_result.nfev}")
print(f"Number of gradient evaluations: {lbfgs_result.njev}")
# optimal_theta_lbfgs = lbfgs_result.x
else:
print(f"Optimization failed: {lbfgs_result.message}")
# Prepend initial cost for plotting
lbfgs_cost_history.insert(0, cost_function(initial_theta, X_train, y_train))
Running this code will output convergence information, the final cost, and the number of iterations and function/gradient evaluations. Notice that L-BFGS typically converges in relatively few iterations compared to first-order methods for well-behaved problems like logistic regression.
To appreciate the efficiency of L-BFGS, let's implement a simple Gradient Descent (GD) algorithm and run it on the same problem. We'll need to choose a learning rate α and run it for a fixed number of iterations or until convergence.
# --- Simple Gradient Descent Implementation ---
def gradient_descent(X, y, initial_theta, learning_rate, n_iterations):
theta = initial_theta.copy()
cost_history = []
for i in range(n_iterations):
cost = cost_function(theta, X, y)
cost_history.append(cost)
grad = gradient(theta, X, y)
theta = theta - learning_rate * grad
if i % 100 == 0: # Print progress occasionally
print(f"GD Iter {i}: Cost = {cost:.6f}")
print(f"GD Final Cost after {n_iterations} iterations: {cost_history[-1]:.6f}")
return theta, cost_history
print("\nRunning Gradient Descent...")
learning_rate = 0.1
n_iterations_gd = 500 # Need significantly more iterations than L-BFGS typically
# Run GD
gd_theta, gd_cost_history = gradient_descent(X_train, y_train, initial_theta, learning_rate, n_iterations_gd)
Now, let's visualize the cost decrease per iteration for both L-BFGS and Gradient Descent using Plotly. This clearly demonstrates the convergence behavior difference.
{"data": [{"type": "scatter", "mode": "lines", "name": "L-BFGS", "y": lbfgs_cost_history, "x": list(range(len(lbfgs_cost_history))), "line": {"color": "#1c7ed6"}}, {"type": "scatter", "mode": "lines", "name": "Gradient Descent (\u03b1=0.1)", "y": gd_cost_history, "x": list(range(len(gd_cost_history))), "line": {"color": "#fd7e14"}}], "layout": {"title": "L-BFGS vs Gradient Descent Convergence", "xaxis": {"title": "Iteration"}, "yaxis": {"title": "Cost (Negative Log-Likelihood)", "type": "linear"}, "legend": {"x": 0.6, "y": 0.95}}}
Convergence comparison for Logistic Regression cost minimization. L-BFGS typically requires far fewer iterations to reach a low cost value compared to standard Gradient Descent, even with a reasonably tuned learning rate for GD.
As the plot typically shows, L-BFGS converges much faster in terms of iterations. While each L-BFGS iteration involves more computation (the two-loop recursion and often a more sophisticated line search) than a GD iteration, the total number of gradient evaluations and overall time to reach a desired precision is often significantly lower, especially on problems where gradients are costly to compute or where curvature information provides a strong signal for the search direction.
m
or maxcor
): This is the primary tuning parameter. A larger m
requires more memory and computation per iteration but can lead to a better Hessian approximation and potentially faster convergence (fewer iterations). Typical values are between 5 and 20. Too large a value diminishes the memory savings compared to BFGS.This practical exercise demonstrated how to effectively use L-BFGS, a powerful quasi-Newton method, through standard libraries like SciPy. Its ability to incorporate curvature information without prohibitive memory costs makes it a valuable tool for various medium-to-large scale machine learning optimization problems, often providing significantly faster convergence than first-order methods.
© 2025 ApX Machine Learning