The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an effective and widely used optimization technique, categorized as a Quasi-Newton method. These methods enhance optimization by approximating the Hessian matrix. Unlike Newton's method, which requires computing and inverting the true Hessian $\nabla^2 f(x)$ at each step, BFGS incrementally builds an approximation to the inverse Hessian, denoted as $H_k$, using only first-order gradient information. This bypasses the significant computational burden associated with the true Hessian.The Core Idea: Approximating the Inverse HessianThe update direction in Newton's method is $p_k = -[\nabla^2 f(x_k)]^{-1} \nabla f(x_k)$. BFGS aims to replace the costly inverse Hessian term $[\nabla^2 f(x_k)]^{-1}$ with a matrix $H_k$ that is cheaper to compute and update. The search direction then becomes:$$ p_k = -H_k \nabla f(x_k) $$The central challenge is how to update $H_k$ to $H_{k+1}$ at each iteration so that it effectively captures curvature information, similar to the true inverse Hessian.The Secant EquationQuasi-Newton methods, including BFGS, are typically constrained by the secant equation. This equation relates the change in gradient $\nabla f$ to the change in the parameter vector $x$. Let $s_k = x_{k+1} - x_k$ be the step taken and $y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$ be the change in gradients. A second-order Taylor expansion motivates the relationship:$$ \nabla f(x_{k+1}) - \nabla f(x_k) \approx \nabla^2 f(x_{k+1}) (x_{k+1} - x_k) $$Rearranging this gives $\nabla^2 f(x_{k+1}) s_k \approx y_k$. If we denote the Hessian approximation at step $k+1$ as $B_{k+1}$ (approximating $\nabla^2 f(x_{k+1})$), the secant equation requires $B_{k+1} s_k = y_k$. Correspondingly, for the inverse Hessian approximation $H_{k+1} = B_{k+1}^{-1}$, the secant equation becomes:$$ H_{k+1} y_k = s_k $$This condition ensures that the updated approximation $H_{k+1}$ correctly relates the observed change in gradient $y_k$ to the step $s_k$ that was taken.The BFGS Update FormulaThe BFGS algorithm provides a specific rank-two update rule for the inverse Hessian approximation $H_k$. Given the current approximation $H_k$, the step $s_k$, and the gradient difference $y_k$, the next approximation $H_{k+1}$ is calculated as:$$ H_{k+1} = \left(I - \frac{s_k y_k^T}{y_k^T s_k}\right) H_k \left(I - \frac{y_k s_k^T}{y_k^T s_k}\right) + \frac{s_k s_k^T}{y_k^T s_k} $$Let's dissect this formula:Input: It uses the previous inverse Hessian approximation $H_k$, the step vector $s_k$, and the gradient difference vector $y_k$.Curvature Condition: The update requires $y_k^T s_k > 0$. This condition is related to the curvature of the function along the search direction. It ensures that the update maintains the positive definiteness of $H_k$ if the initial $H_0$ was positive definite and an appropriate line search (satisfying Wolfe conditions) is used. Positive definiteness of $H_k$ guarantees that the search direction $p_k = -H_k \nabla f(x_k)$ is a descent direction.Rank-Two Update: The update modifies $H_k$ by adding two symmetric rank-one matrices.Properties: The resulting $H_{k+1}$ is symmetric and satisfies the secant equation ($H_{k+1} y_k = s_k$).While the formula looks complex, its structure is designed to incorporate the new curvature information contained in the pair $(s_k, y_k)$ while minimally perturbing the previous approximation $H_k$ in other directions.The BFGS Algorithm StepsThe iterative process for BFGS optimization typically proceeds as follows:Initialization: Choose a starting point $x_0$. Initialize the inverse Hessian approximation $H_0$. A common choice is the identity matrix $H_0 = I$. Set iteration counter $k=0$.Compute Search Direction: Calculate the descent direction $p_k = -H_k \nabla f(x_k)$.Line Search: Find an appropriate step size $\alpha_k > 0$ along the direction $p_k$. This is usually done using a line search algorithm that satisfies the Wolfe conditions, which helps ensure stability and maintain positive definiteness of the Hessian approximation. Common choices include backtracking line search.Update Position: Update the parameter vector: $x_{k+1} = x_k + \alpha_k p_k$.Check Convergence: Evaluate termination criteria (e.g., $|\nabla f(x_{k+1})| < \epsilon$, maximum iterations reached, small change in $x$ or function value). If converged, stop.Compute Update Vectors: Define $s_k = x_{k+1} - x_k = \alpha_k p_k$ and $y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$.Check Curvature Condition: If $y_k^T s_k > 0$:Update Inverse Hessian: Compute $H_{k+1}$ using the BFGS update formula shown above.Else (Curvature Condition Fails): Skip the update (keep $H_{k+1} = H_k$) or use a different strategy to handle this situation. This is rare if a proper line search is used.Increment and Loop: Set $k = k + 1$ and go back to step 2.digraph BFGS_Algorithm { rankdir=TB; node [shape=box, style=rounded, fontname="Arial", fontsize=10, margin=0.2]; edge [fontname="Arial", fontsize=9]; start [label="Initialize x₀, H₀ = I, k = 0", shape=ellipse, style=filled, fillcolor="#e9ecef"]; compute_grad [label="Compute gradient ∇f(xₖ)"]; check_term [label="Check Termination\nCriteria Met?", shape=diamond, style=filled, fillcolor="#ffc9c9"]; compute_dir [label="Compute search direction\npₖ = -Hₖ ∇f(xₖ)"]; line_search [label="Perform Line Search\nFind step size αₖ\n(e.g., Wolfe conditions)"]; update_x [label="Update position\nx_{k+1} = xₖ + αₖ pₖ"]; compute_sy [label="Compute sₖ = x_{k+1} - xₖ\nCompute yₖ = ∇f(x_{k+1}) - ∇f(xₖ)"]; check_curve [label="Curvature Check\nyₖᵀ sₖ > 0?", shape=diamond, style=filled, fillcolor="#a5d8ff"]; update_H [label="Update H_{k+1} using BFGS formula"]; skip_update [label="Skip Update\nH_{k+1} = Hₖ", style=dashed]; inc_k [label="Increment k = k + 1"]; stop [label="Stop", shape=ellipse, style=filled, fillcolor="#b2f2bb"]; start -> compute_grad; compute_grad -> check_term; check_term -> stop [label="Yes"]; check_term -> compute_dir [label="No"]; compute_dir -> line_search; line_search -> update_x; update_x -> compute_sy; compute_sy -> check_curve; check_curve -> update_H [label="Yes"]; check_curve -> skip_update [label="No"]; update_H -> inc_k; skip_update -> inc_k [style=dashed]; inc_k -> compute_grad; }A diagram outlining the iterative steps involved in the BFGS optimization algorithm.Advantages and DisadvantagesAdvantages:No Hessian Calculation/Inversion: Like other Quasi-Newton methods, BFGS avoids the direct computation, storage, and inversion of the $n \times n$ Hessian matrix. It only requires gradient computations.Fast Convergence: BFGS typically exhibits superlinear convergence rates (faster than linear, slower than quadratic) when close to a minimum, making it much faster than first-order methods like gradient descent for many problems.Robustness: It is generally considered one of the most effective general-purpose optimization algorithms for smooth, unconstrained problems.Disadvantages:Memory Cost: The primary drawback of BFGS is the need to store and update the dense $n \times n$ inverse Hessian approximation $H_k$. For high-dimensional problems typical in machine learning (where $n$, the number of parameters, can be millions or billions), storing this matrix becomes infeasible due to memory requirements ($O(n^2)$).Computational Cost per Iteration: While cheaper than Newton's method, updating $H_k$ involves matrix-vector and outer product operations, costing $O(n^2)$ computation per iteration. Computing the search direction $p_k = -H_k \nabla f(x_k)$ also takes $O(n^2)$ operations. This can still be prohibitive for very large $n$.Line Search Requirement: BFGS relies on a reasonably accurate line search (often satisfying the Wolfe conditions) to guarantee convergence and maintain the positive definiteness of $H_k$.The significant memory and computational costs associated with the dense $H_k$ matrix in high dimensions directly motivate the development of Limited-memory BFGS (L-BFGS), which we will discuss in the next section. L-BFGS retains many benefits of BFGS while drastically reducing the memory footprint.