Once a gradient boosting model is instantiated from Scikit-Learn, training it and using it for predictions follows a familiar and consistent pattern. The process aligns with the library's standard API, which involves two primary methods: .fit() to train the model and .predict() to generate outputs. This simple workflow makes applying the Gradient Boosting Machine (GBM) algorithm straightforward.
The entire process, from data to prediction, can be summarized in a few steps. You begin with your data, prepare it, split it into training and testing sets, and then use these sets to train the model and evaluate its performance.
The typical machine learning workflow in Scikit-Learn. The model is trained on one subset of the data and used to make predictions on another.
.fit()The .fit(X, y) method is the workhorse of any Scikit-Learn estimator. When you call this method on a gradient boosting model, you initiate the sequential tree-building process we discussed in the previous chapter.
The X argument is your feature matrix (typically a pandas DataFrame or NumPy array), and y is the target vector (a pandas Series or NumPy array).
Let’s see this in action with GradientBoostingRegressor. Imagine we have a small dataset of property features and want to predict prices.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
# Sample data
data = {
'Area': [2100, 1600, 2400, 1800, 3000, 2200],
'Bedrooms': [3, 3, 3, 2, 4, 3],
'Age': [5, 10, 2, 8, 1, 7],
'Price': [400, 310, 430, 350, 550, 410] # Price in thousands
}
df = pd.DataFrame(data)
X = df[['Area', 'Bedrooms', 'Age']]
y = df['Price']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# 1. Instantiate the model
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
# 2. Fit the model to the training data
gbr.fit(X_train, y_train)
print("Model training complete.")
When gbr.fit(X_train, y_train) is executed, the model performs the following steps internally:
y_train.learning_rate.This continues for n_estimators iterations, with each new tree correcting the errors of the ensemble that came before it.
.predict()After the model is trained, the .predict(X) method is used to generate predictions on new, unseen data. It takes a feature matrix X with the same structure as the training data and returns an array of predicted values.
Continuing our regression example:
# 3. Generate predictions on the test set
predictions = gbr.predict(X_test)
# Display the test data and the model's predictions
results = X_test.copy()
results['Actual_Price'] = y_test
results['Predicted_Price'] = predictions.round(1)
print(results)
Output:
Area Bedrooms Age Actual_Price Predicted_Price
1 1600 3 10 310 334.8
5 2200 3 7 410 404.7
The .predict() method works by passing the input data through every tree in the ensemble. The final prediction for a given sample is the sum of the initial prediction and the outputs from all subsequent trees, each scaled by the learning rate.
A comparison of actual and predicted prices. Points closer to the dashed line indicate more accurate predictions.
.predict() vs. .predict_proba()For classification tasks using GradientBoostingClassifier, you have two methods for making predictions, each serving a different purpose.
.predict(X): This method returns the final predicted class label (e.g., 0 or 1, 'Yes' or 'No'). The model calculates the probability of each class and returns the one with the highest probability.
.predict_proba(X): This method returns the probability estimates for each class. For a binary classification problem, it returns a 2D array where each row contains two values: the probability of the negative class and the probability of the positive class.
Let's illustrate with a simple churn prediction example.
from sklearn.ensemble import GradientBoostingClassifier
# Sample classification data
c_data = {
'Tenure': [2, 48, 12, 1, 36, 24],
'MonthlySpend': [50, 100, 80, 45, 110, 95],
'Churn': [1, 0, 0, 1, 0, 1] # 1 for Churn, 0 for No Churn
}
c_df = pd.DataFrame(c_data)
X_c = c_df[['Tenure', 'MonthlySpend']]
y_c = c_df['Churn']
# Instantiate and fit the classifier
gbc = GradientBoostingClassifier(n_estimators=100, random_state=42)
gbc.fit(X_c, y_c)
# New data to predict on
new_customers = pd.DataFrame({'Tenure': [3, 25], 'MonthlySpend': [60, 105]})
# Get class predictions
class_predictions = gbc.predict(new_customers)
print(f"Class Predictions: {class_predictions}")
# Get probability predictions
proba_predictions = gbc.predict_proba(new_customers)
print("Probability Predictions (No Churn, Churn):")
print(proba_predictions.round(3))
Output:
Class Predictions: [1 0]
Probability Predictions (No Churn, Churn):
[[0.435 0.565]
[0.528 0.472]]
Here's how to interpret the output:
Tenure: 3, MonthlySpend: 60), the predicted class is 1 (Churn). This is because the probability of churn (0.565) is greater than the probability of no churn (0.435).Tenure: 25, MonthlySpend: 105), the predicted class is 0 (No Churn), as its probability (0.528) is higher.The output from .predict_proba() is often more informative than .predict(). It allows you to understand the model's confidence and set custom thresholds for classification. For instance, you might only classify a customer as "Churn" if the predicted probability is above 0.7, depending on the business context.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•