In the gradient boosting framework, the model is built sequentially:
Fm(x)=Fm−1(x)+νhm(x)Here, Fm−1(x) is the model after m−1 boosting rounds, hm(x) is the new base learner (typically a decision tree) trained to fit the residual errors (or gradients) from the previous stage, and ν (often denoted as eta
or learning_rate
in libraries) is the shrinkage parameter, typically a small number between 0 and 1 (e.g., 0.01, 0.1).
At first glance, ν looks identical to the learning rate used in standard gradient descent optimization algorithms. Its role here is analogous but operates within the context of function space. Shrinkage scales the contribution of each new tree added to the ensemble. By setting ν<1, we deliberately slow down the learning process. Instead of allowing each new tree hm(x) to fully correct the errors of the previous model Fm−1(x), we only add a fraction ν of its prediction.
Why is this beneficial for regularization?
Think of it like taking smaller, more cautious steps down the optimization path in function space. Larger steps (high ν) might quickly reduce the training error but risk overshooting the optimal functional fit or fitting noise aggressively. Smaller steps (low ν) proceed more slowly but allow the model to gradually refine its predictions, integrating information from many trees and resulting in a smoother, more generalizable final function.
This regularization effect is considered implicit because shrinkage doesn't add an explicit penalty term to the loss function based on model complexity (like L1/L2 regularization does) nor does it directly constrain the structure of the trees (like setting max_depth
). Instead, it modifies the boosting process itself, inherently encouraging solutions that rely on the collaboration of many weak learners.
The following chart illustrates this concept. Notice how a lower learning rate (ν=0.1) leads to slower convergence on the training set but achieves a better (lower) validation error compared to a higher learning rate (ν=0.8), which overfits rapidly.
Training and validation error curves for models trained with high (ν=0.8) and low (ν=0.1) shrinkage rates. The lower rate requires more rounds but results in better validation performance, mitigating overfitting.
In practice, shrinkage is almost always used (values far less than 1.0 are typical). It forms a fundamental trade-off with the number of boosting rounds (M). A common strategy is to set ν to a small value (e.g., 0.01 to 0.1) and then determine the optimal M using a validation set, often employing early stopping (discussed later in this chapter). While very small ν values can significantly increase computation time due to the large M required, the resulting generalization improvement often justifies the cost. Shrinkage works synergistically with other regularization techniques like tree constraints and subsampling to build robust gradient boosting models.
© 2025 ApX Machine Learning