While gradient boosting models often achieve high predictive accuracy according to metrics like AUC or F1-score, their raw output scores for classification tasks might not directly correspond to true probabilities. For instance, a model predicting a score of 0.9 for a class doesn't necessarily mean there's a 90% objective probability that the instance belongs to that class. The optimization process, particularly with loss functions like log loss, pushes scores towards 0 and 1 to improve discrimination, but this can distort the probabilistic interpretation. When the actual probability estimates are important for decision-making, risk assessment, or downstream tasks like model stacking, calibrating these outputs becomes a necessary step.
Probability calibration is the process of transforming the raw output scores of a classifier into probabilities that better reflect the true likelihood of outcomes. A perfectly calibrated binary classifier has the property that, among the instances where it predicts a probability p, the actual fraction of positive instances is indeed close to p.
A common way to visualize calibration is through a reliability diagram (also known as a calibration curve). This plot bins the predicted probabilities (e.g., 0-0.1, 0.1-0.2, ..., 0.9-1.0) and, for each bin, plots the average predicted probability against the actual fraction of positive instances within that bin.
The diagonal dashed line represents perfect calibration. Bars below the diagonal indicate over-prediction (predicted probability is higher than the actual frequency), while bars above indicate under-prediction. The red bars show an uncalibrated model, while the blue bars show improved calibration after applying a technique.
Boosting algorithms, while powerful, can produce uncalibrated probabilities. The additive nature, focus on correcting errors (often leading to extreme scores for confidently classified points), and the specifics of tree construction can contribute to this. Therefore, checking and potentially correcting the calibration of XGBoost, LightGBM, or CatBoost classifiers is often a good practice.
Two widely used methods for probability calibration are Platt Scaling and Isotonic Regression. Both are typically applied after the main classifier has been trained, using a separate calibration dataset (e.g., a hold-out validation set) that was not used for training the original model. Using the training data for calibration would lead to overly optimistic results.
Platt Scaling assumes that the distortion between the model's output scores and the true probabilities can be corrected by fitting a sigmoid function. For a binary classifier outputting scores s, Platt Scaling finds parameters A and B such that the calibrated probability P(class=1∣s) is estimated by:
PPlatt(y=1∣s)=1+exp(As+B)1The parameters A and B are typically found by optimizing the log loss (or cross-entropy) on the calibration set, relating the original model's scores s to the true labels y in that set.
Platt Scaling works best when the calibration curve is monotonically increasing and has a sigmoid shape. It's computationally efficient and requires relatively less data compared to Isotonic Regression.
Isotonic Regression is a non-parametric approach. It fits a non-decreasing, piecewise-constant function to the relationship between the model's predicted scores and the observed target values in the calibration set. It finds the best-fitting (in a least-squares sense) step function that preserves the order of the inputs.
The primary algorithm used is Pool Adjacent Violators Algorithm (PAVA). Isotonic Regression makes fewer assumptions about the shape of the distortion compared to Platt Scaling. If the calibration curve is not sigmoid-shaped, Isotonic Regression can often provide a better fit. However, it generally requires more data to yield stable results and can sometimes produce sharp steps in the calibrated probabilities, which might not always be desirable.
Scikit-learn provides a convenient wrapper, CalibratedClassifierCV
, to handle both model training and calibration. It can perform calibration using either Platt Scaling (method='sigmoid'
) or Isotonic Regression (method='isotonic'
).
Internally, CalibratedClassifierCV
uses cross-validation. For each fold:
Finally, the base classifier is retrained on all the data, and the calibrators learned during cross-validation are averaged (or selected) to form the final calibrator.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train, X_calib, y_train, y_calib = train_test_split(X_train, y_train, test_size=0.3, random_state=42) # Further split for explicit calibration set if not using CV
# 1. Train the base XGBoost model (example)
base_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
base_model.fit(X_train, y_train)
y_pred_proba_uncalibrated = base_model.predict_proba(X_test)[:, 1]
# 2. Calibrate using CalibratedClassifierCV (handles CV internally)
# Platt Scaling
calibrated_sigmoid = CalibratedClassifierCV(
base_estimator=base_model,
method='sigmoid',
cv=3 # Use 3-fold cross-validation
)
calibrated_sigmoid.fit(X_train, y_train) # Fit on the original training set
y_pred_proba_sigmoid = calibrated_sigmoid.predict_proba(X_test)[:, 1]
# Isotonic Regression
calibrated_isotonic = CalibratedClassifierCV(
base_estimator=base_model,
method='isotonic',
cv=3 # Use 3-fold cross-validation
)
calibrated_isotonic.fit(X_train, y_train)
y_pred_proba_isotonic = calibrated_isotonic.predict_proba(X_test)[:, 1]
# 3. Visualize calibration
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(8, 8))
disp1 = CalibrationDisplay.from_predictions(y_test, y_pred_proba_uncalibrated, n_bins=10, name='Uncalibrated XGBoost', ax=ax, marker='^', color='#fa5252')
disp2 = CalibrationDisplay.from_predictions(y_test, y_pred_proba_sigmoid, n_bins=10, name='Platt Scaling', ax=ax, marker='o', color='#4dabf7')
disp3 = CalibrationDisplay.from_predictions(y_test, y_pred_proba_isotonic, n_bins=10, name='Isotonic Regression', ax=ax, marker='s', color='#69db7c')
ax.set_title('Calibration Curves (Reliability Diagram)')
ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Fraction of Positives')
ax.legend()
plt.grid(True)
plt.show()
CalibratedClassifierCV
handles this via cross-validation. If you manually calibrate, ensure you use a dedicated hold-out set.In summary, probability calibration is an important post-processing technique for classification models, including gradient boosting machines. When reliable probability estimates are needed, methods like Platt Scaling or Isotonic Regression, often applied via tools like CalibratedClassifierCV
, can significantly improve the trustworthiness and utility of your model's predictions.
© 2025 ApX Machine Learning