Applying L-BFGS to solve optimization problems involves practical implementation, building upon concepts from Newton's method and BFGS. Hands-on experience is gained using L-BFGS, with a focus on existing library implementations—a common approach in machine learning workflows. Its performance is compared to simpler methods, illustrating practical usage.
Recall that L-BFGS aims to capture the benefits of second-order information (curvature) like BFGS, but avoids the storage and computation burden of maintaining a full Hessian approximation. It achieves this by storing only a limited history (typically 5 to 20) of the most recent step vectors and gradient difference vectors . These vectors implicitly define the Hessian approximation used in the two-loop recursion to calculate the search direction , where is the inverse Hessian approximation.
For this practical exercise, let's use a standard machine learning problem: Binary Logistic Regression. The goal is to find the parameters that minimize the negative log-likelihood (or cross-entropy loss) function. Given a dataset where (m samples, n features) and , the objective function is:
where is the sigmoid function .
We also need the gradient :
We'll use a synthetic dataset for simplicity, allowing us to focus on the optimizer's behavior.
import numpy as np
from scipy.optimize import minimize
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt # Using for basic plotting, Plotly for final comparison
# Generate synthetic data
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, n_redundant=2,
n_classes=2, random_state=42)
# Add intercept term
X = np.hstack((np.ones((X.shape[0], 1)), X))
n_features = X.shape[1]
# Split data (optional, good practice but not strictly needed for optimizer demo)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Using full dataset for simplicity here
X_train, y_train = X, y
m = X_train.shape[0]
# --- Objective Function and Gradient ---
def sigmoid(z):
# Clip z to avoid overflow/underflow in exp
z_clipped = np.clip(z, -500, 500)
return 1.0 / (1.0 + np.exp(-z_clipped))
def cost_function(theta, X, y):
h = sigmoid(X @ theta)
# Add small epsilon for numerical stability in log
epsilon = 1e-7
cost = -np.mean(y * np.log(h + epsilon) + (1 - y) * np.log(1 - h + epsilon))
return cost
def gradient(theta, X, y):
h = sigmoid(X @ theta)
grad = (X.T @ (h - y)) / len(y)
return grad
# Initial parameters
initial_theta = np.zeros(n_features)
print(f"Dataset shape: {X_train.shape}")
print(f"Initial cost: {cost_function(initial_theta, X_train, y_train):.4f}")
Most scientific computing libraries provide reliable implementations of L-BFGS. We'll use scipy.optimize.minimize, a versatile function that interfaces with various optimization algorithms. The specific variant often used is L-BFGS-B, which allows box constraints (lower and upper bounds on parameters), though we won't use constraints here.
To use scipy.optimize.minimize with L-BFGS, we need to provide:
fun: The objective function to minimize (our cost_function).x0: The initial guess for the parameters (our initial_theta).args: A tuple of additional arguments needed by the objective and gradient functions (our X_train, y_train).method: Specify 'L-BFGS-B'.jac: The function that computes the gradient vector (our gradient). Providing the analytical gradient is much more efficient than relying on numerical approximations.options: A dictionary for algorithm-specific parameters. Important ones for L-BFGS-B include:
maxcor: The number of recent (s, y) pairs to store (the history size m). Default is 10.gtol: Gradient norm tolerance for convergence. The algorithm stops when .maxiter: Maximum number of iterations.# --- Run L-BFGS ---
print("\nRunning L-BFGS...")
# Store history for plotting
lbfgs_cost_history = []
def callback_fn(theta):
"""Callback function to store cost at each iteration."""
cost = cost_function(theta, X_train, y_train)
lbfgs_cost_history.append(cost)
# Optional: print progress
# print(f"Iter {len(lbfgs_cost_history)}: Cost = {cost:.6f}")
lbfgs_result = minimize(fun=cost_function,
x0=initial_theta,
args=(X_train, y_train),
method='L-BFGS-B',
jac=gradient,
callback=callback_fn,
options={'maxcor': 15, # History size m
'gtol': 1e-6, # Gradient tolerance
'disp': True, # Display convergence messages
'maxiter': 200
}
)
# --- Results ---
print("\nL-BFGS Results:")
if lbfgs_result.success:
print(f"Optimization successful: {lbfgs_result.message}")
print(f"Final cost: {lbfgs_result.fun:.6f}")
print(f"Number of iterations: {lbfgs_result.nit}")
print(f"Number of function evaluations: {lbfgs_result.nfev}")
print(f"Number of gradient evaluations: {lbfgs_result.njev}")
# optimal_theta_lbfgs = lbfgs_result.x
else:
print(f"Optimization failed: {lbfgs_result.message}")
# Prepend initial cost for plotting
lbfgs_cost_history.insert(0, cost_function(initial_theta, X_train, y_train))
Running this code will output convergence information, the final cost, and the number of iterations and function/gradient evaluations. Notice that L-BFGS typically converges in relatively few iterations compared to first-order methods for well-behaved problems like logistic regression.
To appreciate the efficiency of L-BFGS, let's implement a simple Gradient Descent (GD) algorithm and run it on the same problem. We'll need to choose a learning rate and run it for a fixed number of iterations or until convergence.
# --- Simple Gradient Descent Implementation ---
def gradient_descent(X, y, initial_theta, learning_rate, n_iterations):
theta = initial_theta.copy()
cost_history = []
for i in range(n_iterations):
cost = cost_function(theta, X, y)
cost_history.append(cost)
grad = gradient(theta, X, y)
theta = theta - learning_rate * grad
if i % 100 == 0: # Print progress occasionally
print(f"GD Iter {i}: Cost = {cost:.6f}")
print(f"GD Final Cost after {n_iterations} iterations: {cost_history[-1]:.6f}")
return theta, cost_history
print("\nRunning Gradient Descent...")
learning_rate = 0.1
n_iterations_gd = 500 # Need significantly more iterations than L-BFGS typically
# Run GD
gd_theta, gd_cost_history = gradient_descent(X_train, y_train, initial_theta, learning_rate, n_iterations_gd)
Now, let's visualize the cost decrease per iteration for both L-BFGS and Gradient Descent using Plotly. This clearly demonstrates the convergence behavior difference.
Convergence comparison for Logistic Regression cost minimization. L-BFGS typically requires far fewer iterations to reach a low cost value compared to standard Gradient Descent, even with a reasonably tuned learning rate for GD.
As the plot typically shows, L-BFGS converges much faster in terms of iterations. While each L-BFGS iteration involves more computation (the two-loop recursion and often a more sophisticated line search) than a GD iteration, the total number of gradient evaluations and overall time to reach a desired precision is often significantly lower, especially on problems where gradients are costly to compute or where curvature information provides a strong signal for the search direction.
m or maxcor): This is the primary tuning parameter. A larger m requires more memory and computation per iteration but can lead to a better Hessian approximation and potentially faster convergence (fewer iterations). Typical values are between 5 and 20. Too large a value diminishes the memory savings compared to BFGS.This practical exercise demonstrated how to effectively use L-BFGS, a powerful quasi-Newton method, through standard libraries like SciPy. Its ability to incorporate curvature information without prohibitive memory costs makes it a valuable tool for various medium-to-large scale machine learning optimization problems, often providing significantly faster convergence than first-order methods.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with