While controlling tree structure (depth, minimum samples) and using techniques like shrinkage and subsampling are effective ways to regularize gradient boosting models, another powerful approach involves modifying the core objective function that the algorithm aims to minimize at each step. Instead of solely focusing on minimizing the loss (like squared error or log loss), we can add a penalty term that explicitly discourages model complexity. This method directly incorporates regularization into the optimization process.
The core idea is to augment the standard loss function L(y,F(x)) with a regularization term Ω(ft) that penalizes the complexity of the new tree ft being added at iteration t. The objective function for finding the best tree ft at step t becomes:
Obj(t)=∑i=1nL(yi,Ft−1(xi)+ft(xi))+Ω(ft)
Here, L is the original loss function, Ft−1(xi) is the prediction from the ensemble built up to iteration t−1, ft(xi) is the prediction of the new tree for sample i, and Ω(ft) is the penalty term for the complexity of that new tree. The goal is now to find the tree ft that minimizes this combined objective.
Two widely used regularization penalties, borrowed from linear models like Lasso and Ridge regression, are the L1 and L2 norms applied to the model parameters. In the context of decision trees within boosting, the "parameters" are typically the output values (weights) assigned to the leaves of the tree. Let T be the number of leaves in the tree ft, and let wj be the output value (weight) assigned to leaf j.
L1 Regularization (Lasso): Adds a penalty proportional to the sum of the absolute values of the leaf weights.
Ω(ft)=α∑j=1T∣wj∣
The hyperparameter α controls the strength of the L1 penalty. Larger values of α push leaf weights towards zero, potentially making some leaf outputs exactly zero. This encourages sparsity in the leaf values. In XGBoost, this corresponds to the reg_alpha
parameter.
L2 Regularization (Ridge): Adds a penalty proportional to the sum of the squared values of the leaf weights.
Ω(ft)=21λ∑j=1Twj2
The hyperparameter λ controls the strength of the L2 penalty (the 21 is included for mathematical convenience during differentiation). Larger values of λ encourage smaller, more distributed leaf weights, preventing any single leaf from having an excessively large output value. This tends to make the model less sensitive to individual data points in a leaf. In XGBoost, this corresponds to the reg_lambda
parameter.
XGBoost provides a clear framework for how these penalties are integrated. Instead of working directly with the potentially complex loss function L, XGBoost uses a second-order Taylor approximation of the loss around the prediction Ft−1(xi):
L(yi,Ft−1(xi)+ft(xi))≈L(yi,Ft−1(xi))+gift(xi)+21hift2(xi)
where gi is the gradient (first derivative) and hi is the Hessian (second derivative) of the loss function L with respect to the prediction, evaluated at Ft−1(xi).
Substituting this approximation into the objective function (and removing constant terms that don't affect the optimization of ft), the objective for iteration t becomes approximately:
Obj(t)≈∑i=1n[gift(xi)+21hift2(xi)]+α∑j=1T∣wj∣+21λ∑j=1Twj2
This objective can be rewritten by summing over the leaves j instead of individual samples i. Let Ij be the set of indices of training samples assigned to leaf j. Then ft(xi)=wj for all i∈Ij. The objective becomes:
Obj(t)≈∑j=1T[(∑i∈Ijgi)wj+21(∑i∈Ijhi)wj2]+α∑j=1T∣wj∣+21λ∑j=1Twj2
Combining terms related to wj:
Obj(t)≈∑j=1T[(∑i∈Ijgi)wj+21(∑i∈Ijhi+λ)wj2+α∣wj∣]
This regularized objective directly influences two key aspects of tree construction:
Optimal Leaf Weights: For a fixed tree structure, the optimal weight wj∗ for each leaf j is chosen to minimize the term in the brackets. In the case of L2 regularization only (α=0), this minimization problem has a straightforward closed-form solution: wj∗=−∑i∈Ijhi+λ∑i∈Ijgi Notice how the L2 penalty λ appears in the denominator. A larger λ shrinks the optimal weight wj∗ towards zero, effectively reducing the influence of the tree added at this step. When L1 regularization (α>0) is included, finding the optimal weight is slightly more complex due to the non-differentiability of the absolute value function at zero, often requiring iterative or approximate methods, but the principle remains the same: the penalty discourages large weights.
Split Finding Gain: When evaluating potential splits during tree construction, algorithms like XGBoost calculate a "gain" associated with the split. This gain measures how much the objective function decreases by replacing a single leaf with two new leaves. The calculation of this gain explicitly incorporates the regularization terms (λ and α). This means that splits leading to excessively large or numerous leaf weights are penalized. The algorithm prefers splits that not only reduce the loss effectively but also result in trees with controlled leaf weights, promoting better generalization.
reg_alpha
) and L2 (λ, reg_lambda
) penalties are important hyperparameters that need tuning, typically via cross-validation alongside other parameters like learning rate and tree depth.reg_alpha
and reg_lambda
parameters, implementing similar regularization principles, though the exact internal calculations might differ slightly. Scikit-learn's GradientBoostingRegressor
and GradientBoostingClassifier
rely more heavily on shrinkage, subsampling, and tree constraints for regularization.By adding L1 and L2 penalties to the objective function, gradient boosting frameworks gain a direct mechanism to control the magnitude of the updates added at each iteration. This is a fundamental technique for preventing overfitting, leading to more robust and generalizable models, especially when combined with other regularization strategies discussed earlier.
© 2025 ApX Machine Learning