Subsampling Parameters for Regularization

While parameters like max_depth and min_child_weight control the complexity of individual trees, another effective method for regularization is to introduce randomness into the tree-building process itself. This technique, known as stochastic gradient boosting, involves training each new learner on a different subsample of the data. By doing so, we reduce the model's variance and its tendency to overfit, much like how bagging operates in algorithms like Random Forests.

This randomness is injected in two primary ways: by sampling the training instances (rows) and by sampling the features (columns).

Row Subsampling for Model Generalization

Row subsampling involves training each tree on a fraction of the total training data. This is controlled by the subsample hyperparameter in Scikit-Learn and XGBoost, or bagging_fraction in LightGBM.

At the beginning of each boosting iteration, a random subset of the training data is selected without replacement. The next tree in the sequence is then trained to correct the errors only for this subset. This prevents any single tree from being overly influenced by specific training instances, such as outliers, that might dominate the gradient calculations.

A typical value for subsample is between 0.5 and 0.8.

A value of 1.0 means all training data is used for every tree, which is equivalent to standard gradient boosting without this form of regularization.
A value of 0.7 means each tree is trained on a randomly selected 70% of the training data.

Reducing the subsample value increases the randomness and can provide a stronger regularization effect. However, setting it too low may lead to underfitting because each tree is built on too little information. This can also increase the training time, as the model may require more trees (n_estimators) to converge.

Column Subsampling to Diversify Features

In addition to sampling rows, we can also sample columns (features) when building each tree. This technique is especially effective for datasets with many features, where some may be highly correlated. It forces the model to use a wider variety of features instead of repeatedly relying on a few dominant ones.

The most common hyperparameter for this is colsample_bytree. This parameter specifies the fraction of columns to be randomly sampled when constructing each tree. For example, if colsample_bytree is set to 0.8 in a dataset with 100 features, each new tree will be built using a randomly chosen subset of 80 features.

Other related parameters offer more granular control:

colsample_bylevel: Controls the fraction of columns sampled for each level of a tree.
colsample_by_node: Controls the fraction of columns sampled for each split within a tree.

Using colsample_bytree encourages the model to build a more diverse set of weak learners, which improves the overall robustness of the final ensemble.

The diagram below illustrates how both row and column subsampling work together. For each new tree, a different subset of both rows and columns is used for training.

Each tree in the boosting sequence is trained on a unique, randomly sampled subset of the original data's rows and columns.

Practical Implementation

In practice, both row and column subsampling are used together. Setting both subsample and colsample_bytree to values less than 1.0 is a standard approach to regularize XGBoost, LightGBM, and other gradient boosting models.

Here is how you would set these parameters when initializing an XGBoost model:

import xgboost as xgb

# Initialize an XGBoost regressor with subsampling
xgb_model = xgb.XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,          # Use 80% of data for training each tree
    colsample_bytree=0.8,   # Use 80% of features for training each tree
    random_state=42
)

# You would then fit this model to your data
# xgb_model.fit(X_train, y_train)

Finding the optimal values for these parameters depends on the dataset. A common strategy is to start with values between 0.7 and 0.9 and use a search technique like Grid Search or Randomized Search, which we will cover next, to find the combination that yields the best performance on a validation set. By introducing this structured randomness, you can build models that are not only powerful but also generalize well to unseen data.

Was this section helpful?

References

Stochastic Gradient Boosting, Jerome H. Friedman, 2002 Computational Statistics & Data Analysis, Vol. 38 (Elsevier) DOI: 10.1016/S0167-9473(01)00065-2 - Introduces the concept of subsampling (stochasticity) into gradient boosting algorithms to improve generalization.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, 2009 (Springer) - Comprehensive textbook covering statistical learning, including detailed sections on ensemble methods like boosting, bagging, and their regularization aspects.
XGBoost Parameters, XGBoost Developers, 2023 - Official documentation describing hyperparameters like subsample, colsample_bytree, and their role in XGBoost regularization.