A significant enhancement XGBoost introduces over traditional Gradient Boosting Machines (GBMs) lies in its objective function. While standard GBM minimizes a loss function iteratively, XGBoost incorporates regularization directly into the objective being optimized at each step. This approach provides more direct control over model complexity, helping to prevent overfitting from the ground up.
Recall that gradient boosting builds an ensemble model additively. At step t, we add a new tree ft(x) to the prediction from the previous step y^(t−1):
y^(t)=y^(t−1)+ηft(x)where η is the learning rate (often called eta
or learning_rate
in XGBoost). XGBoost aims to find the tree ft that optimizes the following objective function at step t:
Here, l(yi,y^i) is the differentiable loss function (e.g., squared error for regression, log loss for classification) measuring the discrepancy between the true label yi and the prediction y^i for the i-th training instance. Ω(fk) represents a regularization term for the k-th tree, penalizing its complexity.
Since the regularization terms for trees f1,...,ft−1 are constant at step t, we can simplify the objective for finding the best ft:
Obj(t)≈i=1∑nl(yi,y^i(t−1)+ft(xi))+Ω(ft)Optimizing this objective directly for a general tree structure ft is computationally challenging. XGBoost employs a clever strategy using a second-order Taylor expansion of the loss function around the current prediction y^i(t−1).
The Taylor expansion of the loss l around y^i(t−1) is:
l(yi,y^i(t−1)+ft(xi))≈l(yi,y^i(t−1))+gift(xi)+21hift(xi)2where gi and hi are the first and second-order derivatives (gradient and Hessian) of the loss function l evaluated at the previous prediction y^i(t−1):
gi=∂y^∂l(yi,y^)y^=y^i(t−1) hi=∂y^2∂2l(yi,y^)y^=y^i(t−1)Using the second-order approximation provides more information about the loss function's shape compared to just using the gradient (as in traditional GBM), often leading to faster convergence and better accuracy, especially for custom loss functions.
Removing the constant term l(yi,y^i(t−1)) (which doesn't affect the optimization of ft), the objective function to minimize at step t becomes:
Obj(t)≈i=1∑n[gift(xi)+21hift(xi)2]+Ω(ft)This approximated objective is now expressed in terms of the gradients gi, Hessians hi, and the new tree ft we need to determine.
XGBoost defines the regularization term Ω(ft) specifically for decision trees. Let a tree ft be defined by its structure (which maps an input xi to a leaf node) and the weights (prediction scores) wj assigned to each leaf j. If the tree has T leaves, the XGBoost regularization term is:
Ω(ft)=γT+21λj=1∑Twj2This term consists of two parts:
gamma
parameter) make the algorithm more conservative, requiring a larger loss reduction to justify adding a new leaf (i.e., making a split). This acts similarly to pruning but is incorporated directly into the objective. It can be seen as a form of L0 regularization on the number of leaves.reg_lambda
parameter) shrink the leaf weights towards zero, making the model less sensitive to individual observations and reducing the influence of any single tree.XGBoost also supports L1 regularization on weights (α∑∣wj∣, controlled by the reg_alpha
parameter), which can induce sparsity in the weights. The full regularization term can be Ω(ft)=γT+21λ∑wj2+α∑∣wj∣. We focus here on the more commonly discussed γ and λ terms.
Let's incorporate the regularization term into our approximated objective. We define Ij={i∣q(xi)=j} as the set of indices of training instances that fall into leaf j, where q(x) represents the tree structure mapping an instance to a leaf index. The prediction ft(xi) for an instance xi in leaf j is simply the leaf weight wj. Substituting this and Ω(ft) into the objective:
Obj(t)≈i=1∑n[giwq(xi)+21hiwq(xi)2]+γT+21λj=1∑Twj2We can group the summation by the leaves of the tree:
Obj(t)≈j=1∑T[(i∈Ij∑gi)wj+21(i∈Ij∑hi)wj2]+γT+21λj=1∑Twj2Rearranging terms:
Obj(t)≈j=1∑T[(i∈Ij∑gi)wj+21(i∈Ij∑hi+λ)wj2]+γTLet Gj=∑i∈Ijgi be the sum of gradients and Hj=∑i∈Ijhi be the sum of Hessians for the instances in leaf j. The objective becomes:
Obj(t)≈j=1∑T[Gjwj+21(Hj+λ)wj2]+γTNow, for a fixed tree structure q (which determines T and the sets Ij), this objective is a sum of independent quadratic functions of the leaf weights wj. We can find the optimal weight wj∗ for each leaf by setting the derivative with respect to wj to zero:
∂wj∂Obj(t)=Gj+(Hj+λ)wj=0Solving for wj:
wj∗=−Hj+λGjThis formula gives the optimal prediction score for each leaf in a given tree structure. It balances the sum of gradients (desired change) against the sum of Hessians (curvature of the loss) plus the L2 regularization term λ. A larger λ shrinks the optimal weight towards zero.
Substituting wj∗ back into the objective function gives the minimum objective value achieved for that specific tree structure q:
Obj(t)(q)=j=1∑T[Gj(−Hj+λGj)+21(Hj+λ)(−Hj+λGj)2]+γT Obj(t)(q)=j=1∑T[−Hj+λGj2+21Hj+λGj2]+γT Obj(t)(q)=−21j=1∑THj+λGj2+γTThis final expression is extremely important. It provides a score for a tree structure q. XGBoost uses this scoring function during tree construction to evaluate the quality of potential splits. The algorithm greedily chooses splits that maximize the reduction in this objective score. The "gain" of a split is calculated as the score of the parent node minus the combined scores of the resulting child nodes, minus the complexity cost γ for adding a new leaf.
The regularized learning objective is a foundation of XGBoost's effectiveness. By incorporating L1 and L2 penalties (α, λ) and a tree complexity penalty (γ) directly into the objective function, and by using a second-order Taylor expansion involving gradients (gi) and Hessians (hi), XGBoost achieves several advantages:
This sophisticated objective function sets the stage for the efficient split-finding algorithms and system optimizations that further distinguish XGBoost, which we will explore next.
© 2025 ApX Machine Learning