All Courses

Implementing GBM with Scikit-learn

Having explored the theoretical underpinnings of the Gradient Boosting Machine algorithm, including functional gradient descent, loss functions, shrinkage, and subsampling, we now turn to its practical implementation using one of Python's primary machine learning libraries, Scikit-learn. Scikit-learn offers well-integrated implementations through its GradientBoostingRegressor and GradientBoostingClassifier classes, encapsulating the core GBM logic within a familiar API.

Scikit-learn's Gradient Boosting Estimators

Scikit-learn provides two main classes for gradient boosting:

GradientBoostingRegressor: For regression tasks.
GradientBoostingClassifier: For classification tasks (binary and multiclass).

These classes follow the standard Scikit-learn estimator API, meaning they possess fit, predict, and predict_proba (for classifiers) methods, along with other utility functions. This consistency simplifies their integration into existing machine learning pipelines.

Mapping Concepts to Parameters

The theoretical components discussed earlier in this chapter directly correspond to hyperparameters within these Scikit-learn classes. Understanding this mapping is essential for effective model configuration:

Loss Function (loss): This parameter determines the loss function to be optimized.
- For GradientBoostingRegressor: Common options include 'ls' (Least Squares Regression, equivalent to squared error), 'lad' (Least Absolute Deviation), 'huber' (a combination of LS and LAD), and 'quantile' (for quantile regression). The default is 'ls'.
- For GradientBoostingClassifier: The primary options are 'deviance' (Log Loss, suitable for probability estimation, used for both binary and multiclass classification) and 'exponential' (which essentially reproduces the AdaBoost algorithm). The default is 'deviance'.
Number of Estimators (n_estimators): This controls the number of boosting stages or sequential trees to build ( $M$ in our earlier notation). A higher number generally leads to a more complex model, potentially overfitting if not balanced with other regularization techniques. Default is 100.
Learning Rate (learning_rate): This corresponds to the shrinkage parameter ( $\nu$ ). It scales the contribution of each tree. Lower values require more estimators for comparable performance but often improve generalization. It acts as a regularization technique by reducing the influence of individual trees. Default is 0.1. There's a typical trade-off: smaller learning_rate often requires larger n_estimators.
Subsampling (subsample): This parameter enables Stochastic Gradient Boosting by specifying the fraction of samples to be used for fitting each individual base learner (tree). If less than 1.0, it introduces randomness, reduces variance, and can improve generalization, often at the cost of a slight increase in bias. It also speeds up computation. Default is 1.0 (no subsampling).
Tree-Specific Parameters: Since GBM uses decision trees as base learners, you can control their structure for regularization:
- max_depth: Maximum depth of the individual regression estimators. Constraining depth limits model complexity. Default is 3.
- min_samples_split: The minimum number of samples required to split an internal node. Default is 2.
- min_samples_leaf: The minimum number of samples required to be at a leaf node. Default is 1.
- max_features: The number/fraction of features to consider when looking for the best split. Introduces randomness similar to Random Forests and provides column subsampling. Default is None (consider all features).
Initialization (init): Allows specifying an initial estimator for the starting prediction $F_0(x)$ . By default, it uses a simple estimator based on the training data (e.g., the mean for regression, log-odds for classification).

Implementing a GBM Regressor

Let's illustrate how to use GradientBoostingRegressor. We'll use a simple synthetic dataset.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import plotly.graph_objects as go

# 1. Generate synthetic data
rng = np.random.RandomState(0)
X = rng.rand(100, 1) * 10
y = np.sin(X).ravel() + rng.normal(0, 0.5, X.shape[0]) # Target with noise

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Initialize and train the GBM Regressor
gbr = GradientBoostingRegressor(
    n_estimators=100,      # Number of trees
    learning_rate=0.1,     # Shrinkage
    max_depth=3,           # Max depth of each tree
    subsample=0.8,         # Use 80% of data for each tree
    loss='ls',             # Least squares loss
    random_state=42
)

gbr.fit(X_train, y_train)

# 4. Make predictions
y_pred = gbr.predict(X_test)

# 5. Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Test Mean Squared Error: {mse:.4f}")

# Create a sorted X_test for smoother plotting
X_test_sorted_indices = np.argsort(X_test.ravel())
X_test_sorted = X_test[X_test_sorted_indices]
y_test_sorted = y_test[X_test_sorted_indices]
y_pred_sorted = gbr.predict(X_test_sorted) # Predict on sorted X_test

# 6. Visualize results (Optional)
fig = go.Figure()
fig.add_trace(go.Scatter(x=X_train.ravel(), y=y_train, mode='markers', name='Training Data', marker=dict(color='#a5d8ff', size=8)))
fig.add_trace(go.Scatter(x=X_test_sorted.ravel(), y=y_test_sorted, mode='markers', name='Test Data (Actual)', marker=dict(color='#ffc9c9', size=8)))
fig.add_trace(go.Scatter(x=X_test_sorted.ravel(), y=y_pred_sorted, mode='lines', name='GBM Predictions', line=dict(color='#f03e3e', width=2)))

fig.update_layout(
    title='Gradient Boosting Regressor Fit',
    xaxis_title='Feature X',
    yaxis_title='Target y',
    legend_title='Data',
    template='plotly_white',
    width=700,
    height=400
)
# To display the plot in environments like Jupyter: fig.show()
# Convert to JSON for web display (if needed):
# print(fig.to_json())

Predictions from the trained Gradient Boosting Regressor model compared against the actual test data points and the original training data.

Implementing a GBM Classifier

The process for classification is analogous, using GradientBoostingClassifier.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, log_loss
import plotly.graph_objects as go
import pandas as pd

# 1. Generate synthetic classification data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0,
                           n_clusters_per_class=1, random_state=42, class_sep=1.0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Initialize and train the GBM Classifier
gbc = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=2,          # Shallower trees often work well for classification
    subsample=0.8,
    loss='deviance',      # Log loss for probability outputs
    random_state=42
)

gbc.fit(X_train, y_train)

# 3. Make predictions
y_pred = gbc.predict(X_test)
y_pred_proba = gbc.predict_proba(X_test)[:, 1] # Probabilities for the positive class

# 4. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
logloss = log_loss(y_test, y_pred_proba)
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Test Log Loss: {logloss:.4f}")

# 5. Feature Importances
importances = gbc.feature_importances_
feature_names = [f'Feature {i}' for i in range(X.shape[1])]
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(importance_df)

# Create Plotly bar chart for feature importances
fig_imp = go.Figure(go.Bar(
    x=importance_df['Importance'],
    y=importance_df['Feature'],
    orientation='h',
    marker_color='#3bc9db'
))
fig_imp.update_layout(
    title='GBM Feature Importances',
    xaxis_title='Importance Score',
    yaxis_title='Feature',
    yaxis={'categoryorder':'total ascending'}, # Show most important at top
    template='plotly_white',
    width=600,
    height=300
)
# To display the plot: fig_imp.show()
# print(fig_imp.to_json())

Feature importances derived from the trained Gradient Boosting Classifier, indicating the relative contribution of each feature to the model's predictions based on impurity reduction.

Feature Importances in Scikit-learn GBM

As demonstrated, the trained GBM model provides a feature_importances_ attribute. These importances are typically calculated based on the total reduction in the loss function (or impurity criterion like Friedman MSE) brought about by splits on that feature across all trees in the ensemble, weighted by the number of samples affected. While useful for a quick assessment of feature relevance, remember that these importance scores can sometimes be misleading, especially with correlated features or when comparing features of different types or scales. Later chapters will introduce more advanced interpretation methods like SHAP values.

Looking Ahead

Scikit-learn's GradientBoostingRegressor and GradientBoostingClassifier provide solid, foundational implementations of the algorithm discussed in this chapter. They are excellent tools for many problems and serve as a stepping stone to understanding more complex boosting libraries. However, for tasks demanding higher performance, speed optimizations, or specialized features like advanced handling of categorical data or missing values, libraries such as XGBoost, LightGBM, and CatBoost (which we will explore in subsequent chapters) often offer significant advantages. Having mastered the core GBM mechanics and its Scikit-learn implementation, you are well-prepared to appreciate the enhancements these specialized libraries bring.

Was this section helpful?