Tuning the Number of Estimators and Learning Rate

Among the many settings available in gradient boosting libraries, the number of estimators and the learning rate are perhaps the most influential. These two hyperparameters work together to control the model's convergence and its capacity to fit the training data. Understanding their relationship is fundamental to building effective models.

The n_estimators parameter specifies the total number of sequential trees to be built. This is equivalent to the number of boosting rounds. More estimators mean the model has more opportunities to correct its mistakes, which can lead to lower training error. However, adding too many trees can cause the model to overfit, as it starts to model the noise in the training data rather than the underlying signal.

The learning_rate, often referred to as eta ( $\eta$ ), scales the contribution of each tree. After a new tree is trained to fit the residuals of the previous stage, its predictions are multiplied by the learning rate before being added to the overall model. A smaller learning rate shrinks the contribution of each individual tree, requiring more trees to be added to the model.

The Inverse Relationship

These two parameters have a strong inverse relationship. A low learning rate requires a high number of estimators to achieve a good fit, while a high learning rate will fit the training data much faster with fewer estimators.

Consider an analogy: you are walking down a hill, trying to find the lowest point.

A high learning rate is like taking large, confident steps. You move down the hill quickly, but you risk overshooting the bottom and ending up on the other side.
A low learning rate is like taking small, careful steps. It takes much longer to reach the bottom, but you are more likely to find the true minimum without overshooting it.

In gradient boosting, this means a high learning rate can cause the model to converge to a suboptimal solution or overfit rapidly. A lower learning rate leads to a more stable and often better-generalized model, but at the cost of increased computational time, as more trees are needed.

The final model prediction, $F_M(x)$ , after $M$ boosting rounds is an accumulation of an initial prediction and the contributions of all subsequent trees, scaled by the learning rate $\eta$ :

F_M(x) = F_0(x) + \eta \sum_{m=1}^{M} h_m(x)

This equation shows how $\eta$ moderates the impact of each new tree, $h_m(x)$ , on the final prediction.

The chart below illustrates this dynamic. A model with a higher learning rate (learning_rate=0.3) sees its validation error drop quickly but then start to rise, indicating overfitting. In contrast, the model with a lower learning rate (learning_rate=0.05) learns more slowly but ultimately achieves a better validation score after more boosting rounds.

A lower learning rate requires more estimators to reach its optimal performance but often results in a lower final validation error compared to a higher learning rate.

A Practical Tuning Strategy with Early Stopping

Searching for the best combination of n_estimators and learning_rate can be computationally expensive. A common and effective strategy is to use early stopping. This technique monitors the model's performance on a separate validation set during training and stops the process once the validation score stops improving for a specified number of rounds. This automatically finds the optimal number of estimators for a given learning rate.

Here is a recommended workflow:

Choose a relatively high learning rate. A common starting point is learning_rate=0.1. This allows for reasonably fast training.
Determine the optimal n_estimators using early stopping. Train your model with a large number of potential estimators (e.g., n_estimators=1000) but use an early stopping parameter to halt training.

For example, in XGBoost, you can implement this as follows:

import xgboost as xgb

# Assuming X_train, y_train, X_val, y_val are defined
model = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.1,
    eval_metric='logloss',
    early_stopping_rounds=50
)

model.fit(
    X_train,
    y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)

print("Best iteration:", model.best_iteration)

The early_stopping_rounds=50 parameter tells the model to stop training if the validation loss (logloss in this case) does not improve for 50 consecutive rounds. The optimal number of trees is stored in the model.best_iteration attribute.

Tune other hyperparameters. With the learning rate and number of estimators now set to reasonable values, you can proceed to tune other parameters that control tree complexity, such as max_depth and subsample.
Decrease the learning rate and increase estimators. Once you have tuned the other parameters, you can often achieve a final performance boost by lowering the learning rate and proportionally increasing the number of estimators. For instance, if you found that 300 estimators was optimal with a learning rate of 0.1, you could try a learning rate of 0.05 with n_estimators around 600, or a learning rate of 0.01 with n_estimators around 3000, again using early stopping to find the new optimal number of trees. This final step slowly fine-tunes the model towards a better minimum.

This structured approach is much more efficient than a brute-force grid search over both parameters simultaneously. It allows you to find a high-performing combination of settings by iteratively refining the model's configuration.

Was this section helpful?

Update history (1)

References

Greedy Function Approximation: A Gradient Boosting Machine, Jerome H. Friedman, 2001 The Annals of Statistics, Vol. 29 (Institute of Mathematical Statistics) DOI: 10.1214/aos/1013203445 - This foundational paper introduces the gradient boosting machine algorithm, detailing the roles of the number of boosting iterations (estimators) and the shrinkage parameter (learning rate) in model construction and regularization.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, 2009 (Springer) DOI: 10.1007/978-0-387-84858-7 - Chapter 10 of this authoritative book provides a comprehensive treatment of boosting, including detailed explanations of the learning rate (shrinkage) and the number of estimators, and their impact on model performance.
XGBoost Python API Reference, XGBoost Contributors, 2023 - The official API reference for XGBoost's Python package, detailing the XGBClassifier class parameters like n_estimators and learning_rate (eta), and the fit method's early_stopping_rounds argument for efficient tuning.