While Newton's method and Quasi-Newton methods like BFGS and L-BFGS determine a search direction (often based on an approximation Bk of the Hessian) and then perform a line search to find an appropriate step size along that direction, Trust Region (TR) methods take a different approach. They define a region around the current iterate xk where they believe the quadratic model mk(p) provides a reliable approximation of the true objective function f(xk+p). They then find the candidate step pk by minimizing this model within the trusted region.
Recall the quadratic model introduced earlier:
mk(p)=f(xk)+∇f(xk)Tp+21pTBkp
Here, Bk is typically the exact Hessian ∇2f(xk) or a Quasi-Newton approximation like the one maintained by BFGS or L-BFGS.
The Trust Region Subproblem
At each iteration k, the core of a TR method involves solving the following trust region subproblem:
p∈Rnminmk(p)subject to∣∣p∣∣≤Δk
The region ∣∣p∣∣≤Δk is the trust region, where Δk>0 is the trust region radius, and the norm ∣∣⋅∣∣ is usually the Euclidean (L2) norm. This subproblem asks for the step p that minimizes our model mk but stays within a distance Δk from the current point xk.
Think of Δk as a budget for the step size. If the full step suggested by minimizing mk without constraint (the Newton or Quasi-Newton step) falls within this budget, we might take it. If it falls outside, the constraint ∣∣p∣∣≤Δk becomes active, and the optimal solution pk will lie on the boundary of the trust region.
Solving this constrained quadratic minimization problem exactly can be complex. However, practical TR algorithms often employ efficient methods to find an approximate solution pk. Popular techniques include:
- Dogleg Method: Particularly useful when Bk is positive definite. It cleverly finds a path (the "dogleg") between the steepest descent direction and the Newton direction, intersecting the trust region boundary.
- Steihaug-Toint Conjugate Gradient Method: Applies the conjugate gradient algorithm to approximately minimize the quadratic model mk(p), but terminates the process if the iterates generated by CG leave the trust region boundary. This method is effective even when Bk is not positive definite.
The choice of subproblem solver influences the overall efficiency and applicability of the TR method.
Adjusting the Trust Region Radius
A key feature of TR methods is the adaptive adjustment of the radius Δk. This adjustment is based on comparing the actual reduction achieved in the objective function f with the reduction predicted by the model mk.
We calculate the ratio ρk:
ρk=Predicted ReductionActual Reduction=mk(0)−mk(pk)f(xk)−f(xk+pk)
Note that mk(0)=f(xk), so the denominator represents the decrease predicted by the quadratic model when taking the step pk.
The value of ρk tells us how well our quadratic model performed within the current trust region:
- Very Good Agreement (ρk large, e.g., ρk>0.75): The model was accurate. We can be more ambitious. Accept the step (xk+1=xk+pk) and increase the trust region radius for the next iteration (e.g., Δk+1=max(Δk,2∣∣pk∣∣) or Δk+1=2Δk).
- Reasonable Agreement (ρk moderate, e.g., 0.1<ρk≤0.75): The model was somewhat accurate. Accept the step (xk+1=xk+pk), but be cautious. Keep the trust region radius the same or decrease it slightly (Δk+1=Δk).
- Poor Agreement (ρk small or negative, e.g., ρk≤0.1): The model was a poor predictor in this region, possibly because the step was too large or the curvature changed rapidly. Reject the step (xk+1=xk) and significantly shrink the trust region (Δk+1=0.5Δk or similar). The algorithm will then re-solve the subproblem at iteration k with the smaller Δk+1.
This feedback mechanism allows TR methods to automatically adapt the step size based on the local geometry of the loss function.
Basic logic for updating the trust region radius Δ and accepting/rejecting the step pk based on the agreement ratio ρk.
Advantages and Considerations
Trust region methods offer several advantages:
- Robustness: They handle indefinite Hessians (where Newton's method can fail or move towards saddle points/maxima) more gracefully than standard line search methods. The trust region constraint inherently limits step size, preventing divergence.
- Strong Convergence Theory: They possess solid theoretical convergence guarantees, often proving convergence to points satisfying second-order necessary conditions, even for non-convex problems.
- No Line Search Needed: The step size is implicitly controlled by the trust region radius Δk, eliminating the need for a separate line search procedure.
However, there are considerations:
- Subproblem Cost: Solving or accurately approximating the solution to the trust region subproblem at each iteration can be computationally more demanding than the simple direction calculation and line search in methods like L-BFGS.
- Parameter Tuning: The specific thresholds for ρk and the factors used for increasing/decreasing Δk are hyperparameters that might require tuning for optimal performance.
Relevance in Machine Learning
In large-scale deep learning, classic TR methods are less frequently used for direct end-to-end training compared to adaptive first-order methods (like Adam) or L-BFGS. This is primarily due to the cost associated with the Hessian information (even if approximated) and solving the subproblem for potentially millions of parameters.
However, understanding TR methods is valuable for several reasons:
- They provide a different perspective on controlling step sizes using curvature information.
- Techniques inspired by TR principles, such as limiting step norms or using model agreement checks, appear in other advanced optimization contexts.
- Hessian-free optimization methods sometimes incorporate TR ideas to stabilize the solution of the linear systems involved.
- They are often effective for smaller-scale ML problems or specific subproblems within larger algorithms where robustness is highly important.
Trust region methods represent a sophisticated class of optimization algorithms that offer strong theoretical properties and practical robustness by carefully managing the region where model approximations are deemed reliable. While perhaps not the default choice for training massive neural networks today, the underlying principles enrich our understanding of numerical optimization.