While the standard Gradient Boosting algorithm iteratively fits new trees to the pseudo-residuals of the previous ensemble, it can be susceptible to overfitting, especially if the individual trees are allowed to grow deep or if many boosting rounds are performed. Just as bagging (Bootstrap Aggregating) introduces randomness to improve the stability and generalization of models like Random Forests, a similar concept can be applied to Gradient Boosting. This variation is often called Stochastic Gradient Boosting (SGB), first introduced by Friedman.
Instead of using the entire training dataset to compute the pseudo-residuals and fit each new base learner (tree), SGB introduces randomness by subsampling the data at each iteration. This modification serves two primary purposes: improving generalization by reducing variance and potentially speeding up computation.
There are two main ways subsampling is commonly implemented in GBM frameworks:
This is the most common form associated with the term "Stochastic Gradient Boosting". Before fitting each new tree hm(x), a random fraction of the training instances (rows) is selected without replacement. Let's say the chosen fraction is ηsubsample (often controlled by a parameter named subsample
in libraries like Scikit-learn). Only this subset of N×ηsubsample samples is used to:
The pseudo-residuals for the unselected instances are not directly used for fitting tree m.
Impact:
A common range for the subsample
parameter is between 0.5 and 0.8. Setting subsample=1.0
recovers the original deterministic GBM algorithm for row selection. Using a value less than 1.0 introduces stochasticity. Too small a fraction might hinder the learning process, increasing bias or requiring significantly more trees.
Row subsampling in Stochastic Gradient Boosting. At each iteration (m, m+1, ...), a different random subset of training instances is used to compute residuals and fit the next tree.
In addition to subsampling rows, we can also subsample features (columns) when constructing each tree. This is analogous to the feature sampling commonly used in Random Forests. Before finding the best split at each node (or sometimes, just once per tree), a random subset of features is considered.
In Scikit-learn's GradientBoostingClassifier
and GradientBoostingRegressor
, this is controlled by the max_features
parameter.
Impact:
The max_features
parameter can typically be set as:
max_features * n_features
features.Using max_features=None
or max_features=n_features
means all features are considered, disabling column subsampling.
Row and column subsampling are not mutually exclusive; they can be used together. Combining them provides a powerful regularization mechanism and can further improve computational efficiency. For instance, setting subsample=0.8
and max_features=0.8
means each tree is built using 80% of the rows and considers 80% of the features when finding splits.
These techniques transform the deterministic GBM into a stochastic algorithm. While introducing randomness might seem counter-intuitive in an optimization process, it often leads to models that generalize better to unseen data by preventing overfitting and exploring a more diverse set of weak learners during the boosting process. Both subsample
and max_features
become important hyperparameters to tune during model development, alongside the learning rate (learning_rate
) and tree complexity parameters (max_depth
, min_samples_split
, etc.). The interplay between these parameters is significant; for example, lower subsampling rates might necessitate more boosting iterations (n_estimators
) or adjustments to the learning rate.
© 2025 ApX Machine Learning