Fitting and Predicting with GBM Models

Once a gradient boosting model is instantiated from Scikit-Learn, training it and using it for predictions follows a familiar and consistent pattern. The process aligns with the library's standard API, which involves two primary methods: .fit() to train the model and .predict() to generate outputs. This simple workflow makes applying the Gradient Boosting Machine (GBM) algorithm straightforward.

The Standard Model Training Workflow

The entire process, from data to prediction, can be summarized in a few steps. You begin with your data, prepare it, split it into training and testing sets, and then use these sets to train the model and evaluate its performance.

The typical machine learning workflow in Scikit-Learn. The model is trained on one subset of the data and used to make predictions on another.

Fitting a Model with `.fit()`

The .fit(X, y) method is the workhorse of any Scikit-Learn estimator. When you call this method on a gradient boosting model, you initiate the sequential tree-building process we discussed in the previous chapter.

The X argument is your feature matrix (typically a pandas DataFrame or NumPy array), and y is the target vector (a pandas Series or NumPy array).

Let’s see this in action with GradientBoostingRegressor. Imagine we have a small dataset of property features and want to predict prices.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor

# Sample data
data = {
    'Area': [2100, 1600, 2400, 1800, 3000, 2200],
    'Bedrooms': [3, 3, 3, 2, 4, 3],
    'Age': [5, 10, 2, 8, 1, 7],
    'Price': [400, 310, 430, 350, 550, 410] # Price in thousands
}
df = pd.DataFrame(data)

X = df[['Area', 'Bedrooms', 'Age']]
y = df['Price']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# 1. Instantiate the model
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# 2. Fit the model to the training data
gbr.fit(X_train, y_train)

print("Model training complete.")

When gbr.fit(X_train, y_train) is executed, the model performs the following steps internally:

It initializes the model with a simple prediction, usually the mean of y_train.
It calculates the errors (residuals) between the predictions and the actual values.
It fits a decision tree to these residuals.
It adds this new tree to the ensemble, scaled by the learning_rate.
It updates the predictions and repeats the process, fitting new trees to the new residuals.

This continues for n_estimators iterations, with each new tree correcting the errors of the ensemble that came before it.

Generating Predictions with `.predict()`

After the model is trained, the .predict(X) method is used to generate predictions on new, unseen data. It takes a feature matrix X with the same structure as the training data and returns an array of predicted values.

Continuing our regression example:

# 3. Generate predictions on the test set
predictions = gbr.predict(X_test)

# Display the test data and the model's predictions
results = X_test.copy()
results['Actual_Price'] = y_test
results['Predicted_Price'] = predictions.round(1)

print(results)

Output:

      Area  Bedrooms  Age  Actual_Price  Predicted_Price
1     1600         3   10           310            334.8
5     2200         3    7           410            404.7

The .predict() method works by passing the input data through every tree in the ensemble. The final prediction for a given sample is the sum of the initial prediction and the outputs from all subsequent trees, each scaled by the learning rate.

A comparison of actual and predicted prices. Points closer to the dashed line indicate more accurate predictions.

Predictions for Classification: `.predict()` vs. `.predict_proba()`

For classification tasks using GradientBoostingClassifier, you have two methods for making predictions, each serving a different purpose.

.predict(X): This method returns the final predicted class label (e.g., 0 or 1, 'Yes' or 'No'). The model calculates the probability of each class and returns the one with the highest probability.
.predict_proba(X): This method returns the probability estimates for each class. For a binary classification problem, it returns a 2D array where each row contains two values: the probability of the negative class and the probability of the positive class.

Let's illustrate with a simple churn prediction example.

from sklearn.ensemble import GradientBoostingClassifier

# Sample classification data
c_data = {
    'Tenure': [2, 48, 12, 1, 36, 24],
    'MonthlySpend': [50, 100, 80, 45, 110, 95],
    'Churn': [1, 0, 0, 1, 0, 1] # 1 for Churn, 0 for No Churn
}
c_df = pd.DataFrame(c_data)

X_c = c_df[['Tenure', 'MonthlySpend']]
y_c = c_df['Churn']

# Instantiate and fit the classifier
gbc = GradientBoostingClassifier(n_estimators=100, random_state=42)
gbc.fit(X_c, y_c)

# New data to predict on
new_customers = pd.DataFrame({'Tenure': [3, 25], 'MonthlySpend': [60, 105]})

# Get class predictions
class_predictions = gbc.predict(new_customers)
print(f"Class Predictions: {class_predictions}")

# Get probability predictions
proba_predictions = gbc.predict_proba(new_customers)
print("Probability Predictions (No Churn, Churn):")
print(proba_predictions.round(3))

Output:

Class Predictions: [1 0]
Probability Predictions (No Churn, Churn):
[[0.435 0.565]
 [0.528 0.472]]

Here's how to interpret the output:

For the first new customer (Tenure: 3, MonthlySpend: 60), the predicted class is 1 (Churn). This is because the probability of churn (0.565) is greater than the probability of no churn (0.435).
For the second customer (Tenure: 25, MonthlySpend: 105), the predicted class is 0 (No Churn), as its probability (0.528) is higher.

The output from .predict_proba() is often more informative than .predict(). It allows you to understand the model's confidence and set custom thresholds for classification. For instance, you might only classify a customer as "Churn" if the predicted probability is above 0.7, depending on the business context.

Was this section helpful?

References

sklearn.ensemble.GradientBoostingRegressor, scikit-learn developers, 2024 (scikit-learn project) - Official documentation describing the parameters and methods (.fit(), .predict()) of the GradientBoostingRegressor class, which is central to the section's practical examples.
Greedy Function Approximation: A Gradient Boosting Machine, Jerome H. Friedman, 2001 The Annals of Statistics, Vol. 29 (Institute of Mathematical Statistics (IMS)) DOI: 10.1214/aos/1013203451 - This seminal paper introduces the Gradient Boosting Machine algorithm, explaining its theoretical foundations and mechanism for sequential error correction, which forms the basis of the discussed models.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, 2009 (Springer) - A foundational textbook on statistical learning, offering in-depth coverage of ensemble methods, including the theoretical background and practical aspects of gradient boosting.

Fitting and Predicting with GBM Models

The Standard Model Training Workflow

Fitting a Model with .fit()

Generating Predictions with .predict()

Predictions for Classification: .predict() vs. .predict_proba()

Fitting a Model with `.fit()`

Generating Predictions with `.predict()`

Predictions for Classification: `.predict()` vs. `.predict_proba()`