Once a model is trained, it's natural to ask what it has learned. Which features are most influential in its predictions? A gradient boosting model, despite being an ensemble of many trees, is not a complete black box. We can inspect it to understand which features are the most significant drivers of its decisions. This process helps validate the model, communicate its findings, and even guide future feature engineering efforts.
In tree-based ensembles like gradient boosting, feature importance is typically calculated based on how much a feature contributes to reducing the model's loss or impurity. The Scikit-Learn implementation uses a method called "mean decrease in impurity" or Gini importance.
Here's how it works at a high level:
A higher score indicates that the feature was more frequently used to make effective splits, and thus, the model relies on it more heavily.
After fitting a GradientBoostingClassifier or GradientBoostingRegressor, you can access the feature importance scores through the feature_importances_ attribute. This attribute returns a NumPy array where each element corresponds to the importance of a feature in the order they appeared in the training data.
Let's look at a practical example. We will train a GradientBoostingRegressor on a synthetic dataset and then inspect the feature importances.
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
# Generate a synthetic dataset
# We make the 3rd feature (index 2) the most informative
X, y = make_regression(n_samples=1000, n_features=10, n_informative=3, random_state=42)
# Initialize and train the model
gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)
gbr.fit(X, y)
# Access the feature importances
importances = gbr.feature_importances_
# Create a list of feature names for clarity
feature_names = [f'Feature {i}' for i in range(X.shape[1])]
# Print the feature importances
for name, score in zip(feature_names, importances):
print(f"{name}: {score:.4f}")
Running this code will produce an output showing the normalized importance score for each of the 10 features. You'll likely see that the informative features specified during data generation have significantly higher scores than the others.
A long list of numbers can be difficult to interpret. A horizontal bar chart is an effective way to visualize and compare the relative importance of each feature. This makes it immediately clear which features the model found most predictive.
We can use the importances and feature_names from the previous example to create a plot. The following chart shows a typical output.
Features are ranked by their contribution to the model's predictive power. In this example, Feature 2 is clearly the most influential.
While feature importance is a valuable tool, it's important to be aware of its limitations.
Was this section helpful?
feature_importances_ attribute.© 2026 ApX Machine LearningEngineered with