Now that we've explored the theoretical underpinnings of Newton's method, BFGS, and the motivation behind L-BFGS, let's put theory into practice. This section provides hands-on experience using L-BFGS to solve an optimization problem, focusing on leveraging existing library implementations, which is the common approach in machine learning workflows. We'll compare its performance to simpler methods and understand its practical usage.

Recall that L-BFGS aims to capture the benefits of second-order information (curvature) like BFGS, but avoids the storage and computation burden of maintaining a full $n \times n$ Hessian approximation. It achieves this by storing only a limited history (typically 5 to 20) of the most recent step vectors $s_k = x_{k+1} - x_k$ and gradient difference vectors $y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$ . These vectors implicitly define the Hessian approximation used in the two-loop recursion to calculate the search direction $p_k = -H_k \nabla f(x_k)$ , where $H_k$ is the inverse Hessian approximation.

Setting Up the Optimization Problem

For this practical exercise, let's use a standard machine learning problem: Binary Logistic Regression. The goal is to find the parameters $\theta$ that minimize the negative log-likelihood (or cross-entropy loss) function. Given a dataset $(X, y)$ where $X \in \mathbb{R}^{m \times n}$ (m samples, n features) and $y \in \{0, 1\}^m$ , the objective function $J(\theta)$ is:

J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))]

where $h_\theta(x) = \sigma(\theta^T x)$ is the sigmoid function $\sigma(z) = 1 / (1 + e^{-z})$ .

We also need the gradient $\nabla J(\theta)$ :

\nabla J(\theta)_j = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}

We'll use a synthetic dataset for simplicity, allowing us to focus on the optimizer's behavior.

import numpy as np
from scipy.optimize import minimize
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt # Using for basic plotting, Plotly for final comparison

# Generate synthetic data
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, n_redundant=2,
                           n_classes=2, random_state=42)
# Add intercept term
X = np.hstack((np.ones((X.shape[0], 1)), X))
n_features = X.shape[1]

# Split data (optional, good practice but not strictly needed for optimizer demo)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Using full dataset for simplicity here
X_train, y_train = X, y
m = X_train.shape[0]

# --- Objective Function and Gradient ---
def sigmoid(z):
    # Clip z to avoid overflow/underflow in exp
    z_clipped = np.clip(z, -500, 500)
    return 1.0 / (1.0 + np.exp(-z_clipped))

def cost_function(theta, X, y):
    h = sigmoid(X @ theta)
    # Add small epsilon for numerical stability in log
    epsilon = 1e-7
    cost = -np.mean(y * np.log(h + epsilon) + (1 - y) * np.log(1 - h + epsilon))
    return cost

def gradient(theta, X, y):
    h = sigmoid(X @ theta)
    grad = (X.T @ (h - y)) / len(y)
    return grad

# Initial parameters
initial_theta = np.zeros(n_features)

print(f"Dataset shape: {X_train.shape}")
print(f"Initial cost: {cost_function(initial_theta, X_train, y_train):.4f}")

Using L-BFGS via SciPy

Most scientific computing libraries provide robust implementations of L-BFGS. We'll use scipy.optimize.minimize, a versatile function that interfaces with various optimization algorithms. The specific variant often used is L-BFGS-B, which allows box constraints (lower and upper bounds on parameters), though we won't use constraints here.

To use scipy.optimize.minimize with L-BFGS, we need to provide:

fun: The objective function to minimize (our cost_function).
x0: The initial guess for the parameters (our initial_theta).
args: A tuple of additional arguments needed by the objective and gradient functions (our X_train, y_train).
method: Specify 'L-BFGS-B'.
jac: The function that computes the gradient vector (our gradient). Providing the analytical gradient is much more efficient than relying on numerical approximations.
options: A dictionary for algorithm-specific parameters. Important ones for L-BFGS-B include:
- maxcor: The number of recent (s, y) pairs to store (the history size m). Default is 10.
- gtol: Gradient norm tolerance for convergence. The algorithm stops when $\max | \nabla J(\theta)_i | \leq \text{gtol}$ .
- maxiter: Maximum number of iterations.

# --- Run L-BFGS ---
print("\nRunning L-BFGS...")

# Store history for plotting
lbfgs_cost_history = []

def callback_fn(theta):
  """Callback function to store cost at each iteration."""
  cost = cost_function(theta, X_train, y_train)
  lbfgs_cost_history.append(cost)
  # Optional: print progress
  # print(f"Iter {len(lbfgs_cost_history)}: Cost = {cost:.6f}")

lbfgs_result = minimize(fun=cost_function,
                        x0=initial_theta,
                        args=(X_train, y_train),
                        method='L-BFGS-B',
                        jac=gradient,
                        callback=callback_fn,
                        options={'maxcor': 15, # History size m
                                 'gtol': 1e-6, # Gradient tolerance
                                 'disp': True, # Display convergence messages
                                 'maxiter': 200
                                }
                       )

# --- Results ---
print("\nL-BFGS Results:")
if lbfgs_result.success:
    print(f"Optimization successful: {lbfgs_result.message}")
    print(f"Final cost: {lbfgs_result.fun:.6f}")
    print(f"Number of iterations: {lbfgs_result.nit}")
    print(f"Number of function evaluations: {lbfgs_result.nfev}")
    print(f"Number of gradient evaluations: {lbfgs_result.njev}")
    # optimal_theta_lbfgs = lbfgs_result.x
else:
    print(f"Optimization failed: {lbfgs_result.message}")

# Prepend initial cost for plotting
lbfgs_cost_history.insert(0, cost_function(initial_theta, X_train, y_train))

Running this code will output convergence information, the final cost, and the number of iterations and function/gradient evaluations. Notice that L-BFGS typically converges in relatively few iterations compared to first-order methods for well-behaved problems like logistic regression.

Comparison with Gradient Descent

To appreciate the efficiency of L-BFGS, let's implement a simple Gradient Descent (GD) algorithm and run it on the same problem. We'll need to choose a learning rate $\alpha$ and run it for a fixed number of iterations or until convergence.

# --- Simple Gradient Descent Implementation ---
def gradient_descent(X, y, initial_theta, learning_rate, n_iterations):
    theta = initial_theta.copy()
    cost_history = []
    for i in range(n_iterations):
        cost = cost_function(theta, X, y)
        cost_history.append(cost)
        grad = gradient(theta, X, y)
        theta = theta - learning_rate * grad
        if i % 100 == 0: # Print progress occasionally
             print(f"GD Iter {i}: Cost = {cost:.6f}")
    print(f"GD Final Cost after {n_iterations} iterations: {cost_history[-1]:.6f}")
    return theta, cost_history

print("\nRunning Gradient Descent...")
learning_rate = 0.1
n_iterations_gd = 500 # Need significantly more iterations than L-BFGS typically

# Run GD
gd_theta, gd_cost_history = gradient_descent(X_train, y_train, initial_theta, learning_rate, n_iterations_gd)

Visualizing Convergence

Now, let's visualize the cost decrease per iteration for both L-BFGS and Gradient Descent using Plotly. This clearly demonstrates the convergence behavior difference.

{"data": [{"type": "scatter", "mode": "lines", "name": "L-BFGS", "y": lbfgs_cost_history, "x": list(range(len(lbfgs_cost_history))), "line": {"color": "#1c7ed6"}}, {"type": "scatter", "mode": "lines", "name": "Gradient Descent (\u03b1=0.1)", "y": gd_cost_history, "x": list(range(len(gd_cost_history))), "line": {"color": "#fd7e14"}}], "layout": {"title": "L-BFGS vs Gradient Descent Convergence", "xaxis": {"title": "Iteration"}, "yaxis": {"title": "Cost (Negative Log-Likelihood)", "type": "linear"}, "legend": {"x": 0.6, "y": 0.95}}}

Convergence comparison for Logistic Regression cost minimization. L-BFGS typically requires far fewer iterations to reach a low cost value compared to standard Gradient Descent, even with a reasonably tuned learning rate for GD.

As the plot typically shows, L-BFGS converges much faster in terms of iterations. While each L-BFGS iteration involves more computation (the two-loop recursion and often a more sophisticated line search) than a GD iteration, the total number of gradient evaluations and overall time to reach a desired precision is often significantly lower, especially on problems where gradients are costly to compute or where curvature information provides a strong signal for the search direction.

Practical Considerations for L-BFGS

History Size (m or maxcor): This is the primary tuning parameter. A larger m requires more memory and computation per iteration but can lead to a better Hessian approximation and potentially faster convergence (fewer iterations). Typical values are between 5 and 20. Too large a value diminishes the memory savings compared to BFGS.
Line Search: L-BFGS relies on a line search algorithm (like one satisfying the Wolfe conditions) to determine the step size along the computed search direction $p_k$ . Library implementations handle this internally, but its quality affects performance. Poor line searches can lead to slow convergence or instability.
Gradient Accuracy: L-BFGS is sensitive to the accuracy of the gradient calculation. Ensure your gradient implementation is correct. Using numerical approximations for the gradient is possible but generally much less efficient and can hinder convergence.
Scaling: Like most optimization algorithms, L-BFGS can benefit from feature scaling (e.g., standardization) which can improve the conditioning of the problem, making the loss surface more uniform and easier to navigate.

This practical exercise demonstrated how to effectively use L-BFGS, a powerful quasi-Newton method, through standard libraries like SciPy. Its ability to incorporate curvature information without prohibitive memory costs makes it a valuable tool for various medium-to-large scale machine learning optimization problems, often providing significantly faster convergence than first-order methods.