Now that we've explored the theoretical underpinnings of various regularization techniques for Gradient Boosting, let's put them into practice. This hands-on section will guide you through implementing and evaluating the effects of tree constraints, shrinkage, subsampling, and early stopping using Scikit-learn's GradientBoostingClassifier
. Our goal is to observe how these techniques mitigate overfitting and improve generalization on unseen data.
First, we need the necessary tools and a dataset susceptible to overfitting. We'll use common Python libraries and generate a synthetic classification dataset using Scikit-learn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, log_loss
from sklearn.datasets import make_classification
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=10, n_redundant=5,
n_clusters_per_class=2, flip_y=0.1,
random_state=42)
# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
This setup gives us distinct training and validation sets, which are essential for evaluating overfitting and the effectiveness of regularization.
Let's start by training a GBM with parameters that are likely to cause overfitting. We'll use a relatively high number of estimators (n_estimators
) and no explicit regularization constraints beyond the defaults (which might include some implicit regularization, like a default max_depth
). We'll set the learning_rate
to a moderate value initially.
# Baseline GBM - potentially overfitting
gbm_baseline = GradientBoostingClassifier(n_estimators=300,
learning_rate=0.1,
max_depth=5, # Reasonably deep trees
random_state=42)
gbm_baseline.fit(X_train, y_train)
# Evaluate performance
y_train_pred_baseline = gbm_baseline.predict(X_train)
y_val_pred_baseline = gbm_baseline.predict(X_val)
y_train_proba_baseline = gbm_baseline.predict_proba(X_train)[:, 1]
y_val_proba_baseline = gbm_baseline.predict_proba(X_val)[:, 1]
print("Baseline Model Performance:")
print(f" Training Accuracy: {accuracy_score(y_train, y_train_pred_baseline):.4f}")
print(f" Validation Accuracy: {accuracy_score(y_val, y_val_pred_baseline):.4f}")
print(f" Training Log Loss: {log_loss(y_train, y_train_proba_baseline):.4f}")
print(f" Validation Log Loss: {log_loss(y_val, y_val_proba_baseline):.4f}")
You'll likely observe a significant gap between the training and validation performance metrics (accuracy and log loss). High training accuracy combined with lower validation accuracy is a classic sign of overfitting. The model has learned the training data too well, including its noise, and fails to generalize.
Now, let's systematically apply the regularization techniques discussed earlier and observe their impact.
max_depth
, min_samples_leaf
)Controlling the complexity of individual trees is a direct way to prevent them from fitting noise. Let's constrain the max_depth
and set a minimum number of samples required per leaf node (min_samples_leaf
).
# GBM with Tree Constraints
gbm_tree_reg = GradientBoostingClassifier(n_estimators=300,
learning_rate=0.1,
max_depth=3, # Shallow trees
min_samples_leaf=10, # Require more samples per leaf
random_state=42)
gbm_tree_reg.fit(X_train, y_train)
# Evaluate performance
y_train_pred_tree = gbm_tree_reg.predict(X_train)
y_val_pred_tree = gbm_tree_reg.predict(X_val)
y_train_proba_tree = gbm_tree_reg.predict_proba(X_train)[:, 1]
y_val_proba_tree = gbm_tree_reg.predict_proba(X_val)[:, 1]
print("\nGBM with Tree Constraints Performance:")
print(f" Training Accuracy: {accuracy_score(y_train, y_train_pred_tree):.4f}")
print(f" Validation Accuracy: {accuracy_score(y_val, y_val_pred_tree):.4f}")
print(f" Training Log Loss: {log_loss(y_train, y_train_proba_tree):.4f}")
print(f" Validation Log Loss: {log_loss(y_val, y_val_proba_tree):.4f}")
Compare these results to the baseline. You should see the training performance decrease slightly, but the validation performance should improve (or the gap between training and validation performance should narrow), indicating better generalization.
learning_rate
)Reducing the learning_rate
forces the model to learn more slowly, requiring more boosting rounds (n_estimators
) to achieve similar performance but often resulting in a more generalized model.
# GBM with Shrinkage
# Reduce learning rate, may need more estimators for convergence
gbm_shrinkage = GradientBoostingClassifier(n_estimators=600, # Increased estimators
learning_rate=0.05, # Lower learning rate
max_depth=3, # Keep tree constraints
min_samples_leaf=10,
random_state=42)
gbm_shrinkage.fit(X_train, y_train)
# Evaluate performance
y_train_pred_shrink = gbm_shrinkage.predict(X_train)
y_val_pred_shrink = gbm_shrinkage.predict(X_val)
y_train_proba_shrink = gbm_shrinkage.predict_proba(X_train)[:, 1]
y_val_proba_shrink = gbm_shrinkage.predict_proba(X_val)[:, 1]
print("\nGBM with Shrinkage Performance:")
print(f" Training Accuracy: {accuracy_score(y_train, y_train_pred_shrink):.4f}")
print(f" Validation Accuracy: {accuracy_score(y_val, y_val_pred_shrink):.4f}")
print(f" Training Log Loss: {log_loss(y_train, y_train_proba_shrink):.4f}")
print(f" Validation Log Loss: {log_loss(y_val, y_val_proba_shrink):.4f}")
Again, compare the performance. Lowering the learning rate often yields smoother convergence and better validation results, provided n_estimators
is adjusted accordingly.
subsample
, max_features
)Introducing randomness by training each tree on a subset of rows (subsample
) or considering only a subset of features for splitting (max_features
) is characteristic of Stochastic Gradient Boosting.
# GBM with Subsampling
gbm_subsample = GradientBoostingClassifier(n_estimators=600,
learning_rate=0.05,
max_depth=3,
min_samples_leaf=10,
subsample=0.7, # Use 70% of rows per tree
max_features=0.8, # Use 80% of features per split
random_state=42)
gbm_subsample.fit(X_train, y_train)
# Evaluate performance
y_train_pred_sub = gbm_subsample.predict(X_train)
y_val_pred_sub = gbm_subsample.predict(X_val)
y_train_proba_sub = gbm_subsample.predict_proba(X_train)[:, 1]
y_val_proba_sub = gbm_subsample.predict_proba(X_val)[:, 1]
print("\nGBM with Subsampling Performance:")
print(f" Training Accuracy: {accuracy_score(y_train, y_train_pred_sub):.4f}")
print(f" Validation Accuracy: {accuracy_score(y_val, y_val_pred_sub):.4f}")
print(f" Training Log Loss: {log_loss(y_train, y_train_proba_sub):.4f}")
print(f" Validation Log Loss: {log_loss(y_val, y_val_proba_sub):.4f}")
Subsampling typically improves robustness, often leading to better validation scores, especially on datasets with high variance or correlated features.
Instead of fixing n_estimators
, we can monitor the model's performance on a validation set during training and stop when the performance stops improving. Scikit-learn's GBM implements this via the n_iter_no_change
, validation_fraction
, and tol
parameters.
# GBM with Early Stopping
# Use a fraction of training data as internal validation for early stopping
gbm_early_stop = GradientBoostingClassifier(n_estimators=1000, # Set a high potential max
learning_rate=0.05,
max_depth=3,
min_samples_leaf=10,
subsample=0.7,
max_features=0.8,
validation_fraction=0.2, # Use 20% of train data for validation
n_iter_no_change=10, # Stop if no improvement for 10 iterations
tol=0.0001,
random_state=42)
gbm_early_stop.fit(X_train, y_train)
# Evaluate performance (using the actual validation set)
y_train_pred_es = gbm_early_stop.predict(X_train)
y_val_pred_es = gbm_early_stop.predict(X_val)
y_train_proba_es = gbm_early_stop.predict_proba(X_train)[:, 1]
y_val_proba_es = gbm_early_stop.predict_proba(X_val)[:, 1]
print("\nGBM with Early Stopping Performance:")
print(f" Optimal number of estimators found: {gbm_early_stop.n_estimators_}")
print(f" Training Accuracy: {accuracy_score(y_train, y_train_pred_es):.4f}")
print(f" Validation Accuracy: {accuracy_score(y_val, y_val_pred_es):.4f}")
print(f" Training Log Loss: {log_loss(y_train, y_train_proba_es):.4f}")
print(f" Validation Log Loss: {log_loss(y_val, y_val_proba_es):.4f}")
# Alternative: Manually plotting validation error vs. iterations
# Train a model without automatic early stopping
gbm_manual_es = GradientBoostingClassifier(n_estimators=300, learning_rate=0.1, max_depth=3, random_state=42)
gbm_manual_es.fit(X_train, y_train)
# Calculate staged log loss (performance after each iteration)
staged_val_loss = [log_loss(y_val, proba[:, 1]) for proba in gbm_manual_es.staged_predict_proba(X_val)]
staged_train_loss = [log_loss(y_train, proba[:, 1]) for proba in gbm_manual_es.staged_predict_proba(X_train)]
best_iteration = np.argmin(staged_val_loss) + 1 # +1 because iteration count is 1-based
print(f"\nManual Early Stopping Analysis:")
print(f" Lowest validation log loss occurred at iteration: {best_iteration}")
print(f" Validation Log Loss at best iteration: {staged_val_loss[best_iteration-1]:.4f}")
# Visualization of Training vs Validation Loss
iterations = np.arange(len(staged_val_loss)) + 1
{"layout": {"title": "GBM Training vs. Validation Log Loss", "xaxis": {"title": "Number of Boosting Iterations"}, "yaxis": {"title": "Log Loss"}, "legend": {"title": "Dataset"}, "template": "plotly_white"}, "data": [{"name": "Validation Loss", "x": iterations.tolist(), "y": staged_val_loss, "type": "scatter", "mode": "lines", "line": {"color": "#f03e3e"}}, {"name": "Training Loss", "x": iterations.tolist(), "y": staged_train_loss, "type": "scatter", "mode": "lines", "line": {"color": "#1c7ed6"}}, {"name": "Best Iteration", "x": [best_iteration], "y": [staged_val_loss[best_iteration-1]], "type": "scatter", "mode": "markers", "marker": {"color": "#f59f00", "size": 10, "symbol": "star"}, "showlegend": True}]}
Log loss on the training and validation sets as the number of boosting iterations increases. The validation loss typically decreases initially and then starts increasing as the model begins to overfit. Early stopping aims to halt training near the minimum validation loss.
Early stopping automates finding a good value for n_estimators
based on validation performance, preventing the model from adding trees once they start hurting generalization. The plot clearly shows the point where validation loss begins to rise, indicating overfitting.
Let's gather the validation accuracy and log loss for each model:
Regularization Method | Validation Accuracy | Validation Log Loss | Notes |
---|---|---|---|
Baseline (Overfitting) | (Value from run) | (Value from run) | High max_depth , no explicit constraints |
Tree Constraints | (Value from run) | (Value from run) | max_depth=3 , min_samples_leaf=10 |
+ Shrinkage | (Value from run) | (Value from run) | Lower learning_rate=0.05 , more n_estimators |
+ Subsampling | (Value from run) | (Value from run) | subsample=0.7 , max_features=0.8 |
+ Early Stopping (Automatic) | (Value from run) | (Value from run) | Optimal n_estimators found automatically |
(Replace "(Value from run)" with the actual metrics obtained when you execute the code.)
You should observe that applying regularization techniques generally improves validation performance (higher accuracy, lower log loss) compared to the overfitting baseline. Often, a combination of techniques (like constrained trees, shrinkage, subsampling, and early stopping) yields the best results.
Note that Scikit-learn's GradientBoostingClassifier
does not directly implement L1/L2 regularization on the tree leaf weights in the same way as XGBoost (which we will cover later). However, the techniques demonstrated here, constraining tree structure, shrinkage, and subsampling, are powerful methods to control model complexity and prevent overfitting within the standard GBM framework.
This practical exercise demonstrates the significant impact of regularization. By thoughtfully applying these techniques, you can build robust Gradient Boosting models that generalize well to new data, moving beyond simply fitting the training set effectively. Experimenting with different parameter values for these techniques is a standard part of the model tuning process, which we will explore in more detail in later chapters.
© 2025 ApX Machine Learning