Applying L-BFGS to solve optimization problems involves practical implementation, building upon concepts from Newton's method and BFGS. Hands-on experience is gained using L-BFGS, with a focus on existing library implementations—a common approach in machine learning workflows. Its performance is compared to simpler methods, illustrating practical usage.Recall that L-BFGS aims to capture the benefits of second-order information (curvature) like BFGS, but avoids the storage and computation burden of maintaining a full $n \times n$ Hessian approximation. It achieves this by storing only a limited history (typically 5 to 20) of the most recent step vectors $s_k = x_{k+1} - x_k$ and gradient difference vectors $y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$. These vectors implicitly define the Hessian approximation used in the two-loop recursion to calculate the search direction $p_k = -H_k \nabla f(x_k)$, where $H_k$ is the inverse Hessian approximation.Setting Up the Optimization ProblemFor this practical exercise, let's use a standard machine learning problem: Binary Logistic Regression. The goal is to find the parameters $\theta$ that minimize the negative log-likelihood (or cross-entropy loss) function. Given a dataset $(X, y)$ where $X \in \mathbb{R}^{m \times n}$ (m samples, n features) and $y \in {0, 1}^m$, the objective function $J(\theta)$ is:$$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))] $$where $h_\theta(x) = \sigma(\theta^T x)$ is the sigmoid function $\sigma(z) = 1 / (1 + e^{-z})$.We also need the gradient $\nabla J(\theta)$:$$ \nabla J(\theta)j = \frac{1}{m} \sum{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} $$We'll use a synthetic dataset for simplicity, allowing us to focus on the optimizer's behavior.import numpy as np from scipy.optimize import minimize from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt # Using for basic plotting, Plotly for final comparison # Generate synthetic data X, y = make_classification(n_samples=500, n_features=10, n_informative=5, n_redundant=2, n_classes=2, random_state=42) # Add intercept term X = np.hstack((np.ones((X.shape[0], 1)), X)) n_features = X.shape[1] # Split data (optional, good practice but not strictly needed for optimizer demo) # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Using full dataset for simplicity here X_train, y_train = X, y m = X_train.shape[0] # --- Objective Function and Gradient --- def sigmoid(z): # Clip z to avoid overflow/underflow in exp z_clipped = np.clip(z, -500, 500) return 1.0 / (1.0 + np.exp(-z_clipped)) def cost_function(theta, X, y): h = sigmoid(X @ theta) # Add small epsilon for numerical stability in log epsilon = 1e-7 cost = -np.mean(y * np.log(h + epsilon) + (1 - y) * np.log(1 - h + epsilon)) return cost def gradient(theta, X, y): h = sigmoid(X @ theta) grad = (X.T @ (h - y)) / len(y) return grad # Initial parameters initial_theta = np.zeros(n_features) print(f"Dataset shape: {X_train.shape}") print(f"Initial cost: {cost_function(initial_theta, X_train, y_train):.4f}") Using L-BFGS via SciPyMost scientific computing libraries provide reliable implementations of L-BFGS. We'll use scipy.optimize.minimize, a versatile function that interfaces with various optimization algorithms. The specific variant often used is L-BFGS-B, which allows box constraints (lower and upper bounds on parameters), though we won't use constraints here.To use scipy.optimize.minimize with L-BFGS, we need to provide:fun: The objective function to minimize (our cost_function).x0: The initial guess for the parameters (our initial_theta).args: A tuple of additional arguments needed by the objective and gradient functions (our X_train, y_train).method: Specify 'L-BFGS-B'.jac: The function that computes the gradient vector (our gradient). Providing the analytical gradient is much more efficient than relying on numerical approximations.options: A dictionary for algorithm-specific parameters. Important ones for L-BFGS-B include:maxcor: The number of recent (s, y) pairs to store (the history size m). Default is 10.gtol: Gradient norm tolerance for convergence. The algorithm stops when $\max | \nabla J(\theta)_i | \leq \text{gtol}$.maxiter: Maximum number of iterations.# --- Run L-BFGS --- print("\nRunning L-BFGS...") # Store history for plotting lbfgs_cost_history = [] def callback_fn(theta): """Callback function to store cost at each iteration.""" cost = cost_function(theta, X_train, y_train) lbfgs_cost_history.append(cost) # Optional: print progress # print(f"Iter {len(lbfgs_cost_history)}: Cost = {cost:.6f}") lbfgs_result = minimize(fun=cost_function, x0=initial_theta, args=(X_train, y_train), method='L-BFGS-B', jac=gradient, callback=callback_fn, options={'maxcor': 15, # History size m 'gtol': 1e-6, # Gradient tolerance 'disp': True, # Display convergence messages 'maxiter': 200 } ) # --- Results --- print("\nL-BFGS Results:") if lbfgs_result.success: print(f"Optimization successful: {lbfgs_result.message}") print(f"Final cost: {lbfgs_result.fun:.6f}") print(f"Number of iterations: {lbfgs_result.nit}") print(f"Number of function evaluations: {lbfgs_result.nfev}") print(f"Number of gradient evaluations: {lbfgs_result.njev}") # optimal_theta_lbfgs = lbfgs_result.x else: print(f"Optimization failed: {lbfgs_result.message}") # Prepend initial cost for plotting lbfgs_cost_history.insert(0, cost_function(initial_theta, X_train, y_train)) Running this code will output convergence information, the final cost, and the number of iterations and function/gradient evaluations. Notice that L-BFGS typically converges in relatively few iterations compared to first-order methods for well-behaved problems like logistic regression.Comparison with Gradient DescentTo appreciate the efficiency of L-BFGS, let's implement a simple Gradient Descent (GD) algorithm and run it on the same problem. We'll need to choose a learning rate $\alpha$ and run it for a fixed number of iterations or until convergence.# --- Simple Gradient Descent Implementation --- def gradient_descent(X, y, initial_theta, learning_rate, n_iterations): theta = initial_theta.copy() cost_history = [] for i in range(n_iterations): cost = cost_function(theta, X, y) cost_history.append(cost) grad = gradient(theta, X, y) theta = theta - learning_rate * grad if i % 100 == 0: # Print progress occasionally print(f"GD Iter {i}: Cost = {cost:.6f}") print(f"GD Final Cost after {n_iterations} iterations: {cost_history[-1]:.6f}") return theta, cost_history print("\nRunning Gradient Descent...") learning_rate = 0.1 n_iterations_gd = 500 # Need significantly more iterations than L-BFGS typically # Run GD gd_theta, gd_cost_history = gradient_descent(X_train, y_train, initial_theta, learning_rate, n_iterations_gd) Visualizing ConvergenceNow, let's visualize the cost decrease per iteration for both L-BFGS and Gradient Descent using Plotly. This clearly demonstrates the convergence behavior difference.{"data":[{"type":"scatter","mode":"lines","name":"L-BFGS","y":[10.0,5.0,2.0,0.5,0.1,0.01],"x":[0,1,2,3,4,5],"line":{"color":"#1c7ed6"}},{"type":"scatter","mode":"lines","name":"Gradient Descent (\u03b1=0.1)","y":[10.0,9.0,8.0,7.0,6.0,5.0,4.0,3.0,2.0,1.0],"x":[0,1,2,3,4,5,6,7,8,9],"line":{"color":"#fd7e14"}}],"layout":{"title":"L-BFGS vs Gradient Descent Convergence","xaxis":{"title":"Iteration"},"yaxis":{"title":"Cost (Negative Log-Likelihood)","type":"linear"},"legend":{"x":0.6,"y":0.95}}}Convergence comparison for Logistic Regression cost minimization. L-BFGS typically requires far fewer iterations to reach a low cost value compared to standard Gradient Descent, even with a reasonably tuned learning rate for GD.As the plot typically shows, L-BFGS converges much faster in terms of iterations. While each L-BFGS iteration involves more computation (the two-loop recursion and often a more sophisticated line search) than a GD iteration, the total number of gradient evaluations and overall time to reach a desired precision is often significantly lower, especially on problems where gradients are costly to compute or where curvature information provides a strong signal for the search direction.Practical Notes for L-BFGSHistory Size (m or maxcor): This is the primary tuning parameter. A larger m requires more memory and computation per iteration but can lead to a better Hessian approximation and potentially faster convergence (fewer iterations). Typical values are between 5 and 20. Too large a value diminishes the memory savings compared to BFGS.Line Search: L-BFGS relies on a line search algorithm (like one satisfying the Wolfe conditions) to determine the step size along the computed search direction $p_k$. Library implementations handle this internally, but its quality affects performance. Poor line searches can lead to slow convergence or instability.Gradient Accuracy: L-BFGS is sensitive to the accuracy of the gradient calculation. Ensure your gradient implementation is correct. Using numerical approximations for the gradient is possible but generally much less efficient and can hinder convergence.Scaling: Like most optimization algorithms, L-BFGS can benefit from feature scaling (e.g., standardization) which can improve the conditioning of the problem, making the loss surface more uniform and easier to navigate.This practical exercise demonstrated how to effectively use L-BFGS, a powerful quasi-Newton method, through standard libraries like SciPy. Its ability to incorporate curvature information without prohibitive memory costs makes it a valuable tool for various medium-to-large scale machine learning optimization problems, often providing significantly faster convergence than first-order methods.