Implementing and evaluating the effects of tree constraints, shrinkage, subsampling, and early stopping using Scikit-learn's GradientBoostingClassifier provides practical application of regularization techniques for Gradient Boosting. The objective is to observe how these techniques mitigate overfitting and improve generalization on unseen data.Setting the StageFirst, we need the necessary tools and a dataset susceptible to overfitting. We'll use common Python libraries and generate a synthetic classification dataset using Scikit-learn.import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import accuracy_score, log_loss from sklearn.datasets import make_classification # Generate a synthetic dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=5, n_clusters_per_class=2, flip_y=0.1, random_state=42) # Split into training and validation sets X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42) print(f"Training set shape: {X_train.shape}") print(f"Validation set shape: {X_val.shape}")This setup gives us distinct training and validation sets, which are essential for evaluating overfitting and the effectiveness of regularization.Baseline: An Overfitting ModelLet's start by training a GBM with parameters that are likely to cause overfitting. We'll use a relatively high number of estimators (n_estimators) and no explicit regularization constraints (which might include some implicit regularization, like a default max_depth). We'll set the learning_rate to a moderate value initially.# Baseline GBM - potentially overfitting gbm_baseline = GradientBoostingClassifier(n_estimators=300, learning_rate=0.1, max_depth=5, # Reasonably deep trees random_state=42) gbm_baseline.fit(X_train, y_train) # Evaluate performance y_train_pred_baseline = gbm_baseline.predict(X_train) y_val_pred_baseline = gbm_baseline.predict(X_val) y_train_proba_baseline = gbm_baseline.predict_proba(X_train)[:, 1] y_val_proba_baseline = gbm_baseline.predict_proba(X_val)[:, 1] print("Baseline Model Performance:") print(f" Training Accuracy: {accuracy_score(y_train, y_train_pred_baseline):.4f}") print(f" Validation Accuracy: {accuracy_score(y_val, y_val_pred_baseline):.4f}") print(f" Training Log Loss: {log_loss(y_train, y_train_proba_baseline):.4f}") print(f" Validation Log Loss: {log_loss(y_val, y_val_proba_baseline):.4f}")You'll likely observe a significant gap between the training and validation performance metrics (accuracy and log loss). High training accuracy combined with lower validation accuracy is a classic sign of overfitting. The model has learned the training data too well, including its noise, and fails to generalize.Applying Regularization TechniquesNow, let's systematically apply the regularization techniques discussed earlier and observe their impact.1. Tree Constraints (max_depth, min_samples_leaf)Controlling the complexity of individual trees is a direct way to prevent them from fitting noise. Let's constrain the max_depth and set a minimum number of samples required per leaf node (min_samples_leaf).# GBM with Tree Constraints gbm_tree_reg = GradientBoostingClassifier(n_estimators=300, learning_rate=0.1, max_depth=3, # Shallow trees min_samples_leaf=10, # Require more samples per leaf random_state=42) gbm_tree_reg.fit(X_train, y_train) # Evaluate performance y_train_pred_tree = gbm_tree_reg.predict(X_train) y_val_pred_tree = gbm_tree_reg.predict(X_val) y_train_proba_tree = gbm_tree_reg.predict_proba(X_train)[:, 1] y_val_proba_tree = gbm_tree_reg.predict_proba(X_val)[:, 1] print("\nGBM with Tree Constraints Performance:") print(f" Training Accuracy: {accuracy_score(y_train, y_train_pred_tree):.4f}") print(f" Validation Accuracy: {accuracy_score(y_val, y_val_pred_tree):.4f}") print(f" Training Log Loss: {log_loss(y_train, y_train_proba_tree):.4f}") print(f" Validation Log Loss: {log_loss(y_val, y_val_proba_tree):.4f}")Compare these results to the baseline. You should see the training performance decrease slightly, but the validation performance should improve (or the gap between training and validation performance should narrow), indicating better generalization.2. Shrinkage (learning_rate)Reducing the learning_rate forces the model to learn more slowly, requiring more boosting rounds (n_estimators) to achieve similar performance but often resulting in a more generalized model.# GBM with Shrinkage # Reduce learning rate, may need more estimators for convergence gbm_shrinkage = GradientBoostingClassifier(n_estimators=600, # Increased estimators learning_rate=0.05, # Lower learning rate max_depth=3, # Keep tree constraints min_samples_leaf=10, random_state=42) gbm_shrinkage.fit(X_train, y_train) # Evaluate performance y_train_pred_shrink = gbm_shrinkage.predict(X_train) y_val_pred_shrink = gbm_shrinkage.predict(X_val) y_train_proba_shrink = gbm_shrinkage.predict_proba(X_train)[:, 1] y_val_proba_shrink = gbm_shrinkage.predict_proba(X_val)[:, 1] print("\nGBM with Shrinkage Performance:") print(f" Training Accuracy: {accuracy_score(y_train, y_train_pred_shrink):.4f}") print(f" Validation Accuracy: {accuracy_score(y_val, y_val_pred_shrink):.4f}") print(f" Training Log Loss: {log_loss(y_train, y_train_proba_shrink):.4f}") print(f" Validation Log Loss: {log_loss(y_val, y_val_proba_shrink):.4f}")Again, compare the performance. Lowering the learning rate often yields smoother convergence and better validation results, provided n_estimators is adjusted accordingly.3. Subsampling (subsample, max_features)Introducing randomness by training each tree on a subset of rows (subsample) or considering only a subset of features for splitting (max_features) is characteristic of Stochastic Gradient Boosting.# GBM with Subsampling gbm_subsample = GradientBoostingClassifier(n_estimators=600, learning_rate=0.05, max_depth=3, min_samples_leaf=10, subsample=0.7, # Use 70% of rows per tree max_features=0.8, # Use 80% of features per split random_state=42) gbm_subsample.fit(X_train, y_train) # Evaluate performance y_train_pred_sub = gbm_subsample.predict(X_train) y_val_pred_sub = gbm_subsample.predict(X_val) y_train_proba_sub = gbm_subsample.predict_proba(X_train)[:, 1] y_val_proba_sub = gbm_subsample.predict_proba(X_val)[:, 1] print("\nGBM with Subsampling Performance:") print(f" Training Accuracy: {accuracy_score(y_train, y_train_pred_sub):.4f}") print(f" Validation Accuracy: {accuracy_score(y_val, y_val_pred_sub):.4f}") print(f" Training Log Loss: {log_loss(y_train, y_train_proba_sub):.4f}") print(f" Validation Log Loss: {log_loss(y_val, y_val_proba_sub):.4f}")Subsampling typically improves robustness, often leading to better validation scores, especially on datasets with high variance or correlated features.4. Early StoppingInstead of fixing n_estimators, we can monitor the model's performance on a validation set during training and stop when the performance stops improving. Scikit-learn's GBM implements this via the n_iter_no_change, validation_fraction, and tol parameters.# GBM with Early Stopping # Use a fraction of training data as internal validation for early stopping gbm_early_stop = GradientBoostingClassifier(n_estimators=1000, # Set a high potential max learning_rate=0.05, max_depth=3, min_samples_leaf=10, subsample=0.7, max_features=0.8, validation_fraction=0.2, # Use 20% of train data for validation n_iter_no_change=10, # Stop if no improvement for 10 iterations tol=0.0001, random_state=42) gbm_early_stop.fit(X_train, y_train) # Evaluate performance (using the actual validation set) y_train_pred_es = gbm_early_stop.predict(X_train) y_val_pred_es = gbm_early_stop.predict(X_val) y_train_proba_es = gbm_early_stop.predict_proba(X_train)[:, 1] y_val_proba_es = gbm_early_stop.predict_proba(X_val)[:, 1] print("\nGBM with Early Stopping Performance:") print(f" Optimal number of estimators found: {gbm_early_stop.n_estimators_}") print(f" Training Accuracy: {accuracy_score(y_train, y_train_pred_es):.4f}") print(f" Validation Accuracy: {accuracy_score(y_val, y_val_pred_es):.4f}") print(f" Training Log Loss: {log_loss(y_train, y_train_proba_es):.4f}") print(f" Validation Log Loss: {log_loss(y_val, y_val_proba_es):.4f}") # Alternative: Manually plotting validation error vs. iterations # Train a model without automatic early stopping gbm_manual_es = GradientBoostingClassifier(n_estimators=300, learning_rate=0.1, max_depth=3, random_state=42) gbm_manual_es.fit(X_train, y_train) # Calculate staged log loss (performance after each iteration) staged_val_loss = [log_loss(y_val, proba[:, 1]) for proba in gbm_manual_es.staged_predict_proba(X_val)] staged_train_loss = [log_loss(y_train, proba[:, 1]) for proba in gbm_manual_es.staged_predict_proba(X_train)] best_iteration = np.argmin(staged_val_loss) + 1 # +1 because iteration count is 1-based print(f"\nManual Early Stopping Analysis:") print(f" Lowest validation log loss occurred at iteration: {best_iteration}") print(f" Validation Log Loss at best iteration: {staged_val_loss[best_iteration-1]:.4f}") # Visualization of Training vs Validation Loss iterations = np.arange(len(staged_val_loss)) + 1{"layout": {"title": "GBM Training vs. Validation Log Loss", "xaxis": {"title": "Number of Boosting Iterations"}, "yaxis": {"title": "Log Loss"}, "legend": {"title": "Dataset"}, "template": "plotly_white"}, "data": [{"name": "Validation Loss", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100], "y": [0.5, 0.497, 0.494, 0.491, 0.508, 0.495, 0.492, 0.489, 0.486, 0.503, 0.49, 0.487, 0.484, 0.481, 0.498, 0.485, 0.482, 0.479, 0.476, 0.493, 0.48, 0.477, 0.474, 0.471, 0.488, 0.475, 0.472, 0.469, 0.466, 0.483, 0.47, 0.467, 0.464, 0.461, 0.478, 0.465, 0.462, 0.459, 0.456, 0.473, 0.46, 0.457, 0.454, 0.451, 0.468, 0.455, 0.452, 0.449, 0.446, 0.463, 0.45, 0.447, 0.444, 0.441, 0.458, 0.445, 0.442, 0.439, 0.436, 0.453, 0.44, 0.437, 0.434, 0.431, 0.448, 0.435, 0.432, 0.429, 0.426, 0.443, 0.43, 0.427, 0.424, 0.421, 0.438, 0.425, 0.422, 0.419, 0.416, 0.433, 0.42, 0.417, 0.414, 0.411, 0.428, 0.415, 0.412, 0.409, 0.406, 0.423, 0.41, 0.407, 0.404, 0.401, 0.418, 0.405, 0.402, 0.399, 0.396], "type": "scatter", "mode": "lines", "line": {"color": "#f03e3e"}}, {"name": "Training Loss", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100], "y": [0.398, 0.396, 0.394, 0.392, 0.39, 0.388, 0.386, 0.384, 0.382, 0.38, 0.378, 0.376, 0.374, 0.372, 0.37, 0.368, 0.366, 0.364, 0.362, 0.36, 0.358, 0.356, 0.354, 0.352, 0.35, 0.348, 0.346, 0.344, 0.342, 0.34, 0.338, 0.336, 0.334, 0.332, 0.33, 0.328, 0.326, 0.324, 0.322, 0.32, 0.318, 0.316, 0.314, 0.312, 0.31, 0.308, 0.306, 0.304, 0.302, 0.3, 0.298, 0.296, 0.294, 0.292, 0.29, 0.288, 0.286, 0.284, 0.282, 0.28, 0.278, 0.276, 0.274, 0.272, 0.27, 0.268, 0.266, 0.264, 0.262, 0.26, 0.258, 0.256, 0.254, 0.252, 0.25, 0.248, 0.246, 0.244, 0.242, 0.24, 0.238, 0.236, 0.234, 0.232, 0.23, 0.228, 0.226, 0.224, 0.222, 0.22, 0.218, 0.216, 0.214, 0.212, 0.21], "type": "scatter", "mode": "lines", "line": {"color": "#1c7ed6"}}, {"name": "Best Iteration", "x": [75], "y": [0.438], "type": "scatter", "mode": "markers", "marker": {"color": "#f59f00", "size": 10, "symbol": "star"}, "showlegend": true}]}Log loss on the training and validation sets as the number of boosting iterations increases. The validation loss typically decreases initially and then starts increasing as the model begins to overfit. Early stopping aims to halt training near the minimum validation loss.Early stopping automates finding a good value for n_estimators based on validation performance, preventing the model from adding trees once they start hurting generalization. The plot clearly shows the point where validation loss begins to rise, indicating overfitting.Comparison and SummaryLet's gather the validation accuracy and log loss for each model:Regularization MethodValidation AccuracyValidation Log LossNotesBaseline (Overfitting)(Value from run)(Value from run)High max_depth, no explicit constraintsTree Constraints(Value from run)(Value from run)max_depth=3, min_samples_leaf=10+ Shrinkage(Value from run)(Value from run)Lower learning_rate=0.05, more n_estimators+ Subsampling(Value from run)(Value from run)subsample=0.7, max_features=0.8+ Early Stopping (Automatic)(Value from run)(Value from run)Optimal n_estimators found automatically(Replace "(Value from run)" with the actual metrics obtained when you execute the code.)You should observe that applying regularization techniques generally improves validation performance (higher accuracy, lower log loss) compared to the overfitting baseline. Often, a combination of techniques (like constrained trees, shrinkage, subsampling, and early stopping) yields the best results.Note that Scikit-learn's GradientBoostingClassifier does not directly implement L1/L2 regularization on the tree leaf weights in the same way as XGBoost (which we will cover later). However, the techniques demonstrated here, constraining tree structure, shrinkage, and subsampling, are powerful methods to control model complexity and prevent overfitting within the standard GBM framework.This practical exercise demonstrates the significant impact of regularization. By thoughtfully applying these techniques, you can build Gradient Boosting models that generalize well to new data, not just fitting the training set effectively. Experimenting with different parameter values for these techniques is a standard part of the model tuning process, which we will discuss in more detail in later chapters.