Feature Importance in GBM

Once a model is trained, it's natural to ask what it has learned. Which features are most influential in its predictions? A gradient boosting model, despite being an ensemble of many trees, is not a complete black box. We can inspect it to understand which features are the most significant drivers of its decisions. This process helps validate the model, communicate its findings, and even guide future feature engineering efforts.

How Gradient Boosting Models Measure Importance

In tree-based ensembles like gradient boosting, feature importance is typically calculated based on how much a feature contributes to reducing the model's loss or impurity. The Scikit-Learn implementation uses a method called "mean decrease in impurity" or Gini importance.

Here's how it works at a high level:

Node Impurity: Each time a tree is built, it splits the data at various nodes. Each split is chosen to maximize the decrease in impurity (e.g., variance for regression, Gini impurity for classification) from the parent node to the two child nodes.
Importance Score: The amount of reduction in impurity achieved by a split on a particular feature is that feature's "importance" for that split.
Total Importance: The feature's total importance is the sum of these reductions across all splits where that feature was used, aggregated over all trees in the ensemble.
Normalization: Finally, these total importance scores are typically normalized so that they sum to 1.0, making them easier to compare. A feature with a score of 0.30 contributed to 30% of the total impurity reduction across all trees.

A higher score indicates that the feature was more frequently used to make effective splits, and thus, the model relies on it more heavily.

Accessing Feature Importance with Scikit-Learn

After fitting a GradientBoostingClassifier or GradientBoostingRegressor, you can access the feature importance scores through the feature_importances_ attribute. This attribute returns a NumPy array where each element corresponds to the importance of a feature in the order they appeared in the training data.

Let's look at a practical example. We will train a GradientBoostingRegressor on a synthetic dataset and then inspect the feature importances.

from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np

# Generate a synthetic dataset
# We make the 3rd feature (index 2) the most informative
X, y = make_regression(n_samples=1000, n_features=10, n_informative=3, random_state=42)

# Initialize and train the model
gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)
gbr.fit(X, y)

# Access the feature importances
importances = gbr.feature_importances_

# Create a list of feature names for clarity
feature_names = [f'Feature {i}' for i in range(X.shape[1])]

# Print the feature importances
for name, score in zip(feature_names, importances):
    print(f"{name}: {score:.4f}")

Running this code will produce an output showing the normalized importance score for each of the 10 features. You'll likely see that the informative features specified during data generation have significantly higher scores than the others.

Visualizing Feature Importance

A long list of numbers can be difficult to interpret. A horizontal bar chart is an effective way to visualize and compare the relative importance of each feature. This makes it immediately clear which features the model found most predictive.

We can use the importances and feature_names from the previous example to create a plot. The following chart shows a typical output.

Features are ranked by their contribution to the model's predictive power. In this example, Feature 2 is clearly the most influential.

Important Limitations

While feature importance is a valuable tool, it's important to be aware of its limitations.

Bias Towards High Cardinality Features: Tree-based models can sometimes assign higher importance to continuous features or categorical features with many unique values. This is because these features offer more potential split points, increasing the chances of finding a split that reduces impurity.
Correlated Features: If two or more features are highly correlated, they might act as substitutes for each other. When the model builds its trees, it might pick one of them for a split, and its importance score will increase. The other correlated features might not be selected as often, leading to lower importance scores. The total importance of the underlying signal carried by these features gets diluted across them, potentially misleading you to think the signal is weaker than it is.
Importance Does Not Imply Direction: Feature importance tells you how much a feature matters, but not how it matters. For instance, a high importance score for "house_age" does not tell you whether older houses have higher or lower prices. It only tells you that "house_age" is a strong predictor. To understand the relationship between a feature and the prediction, we need other tools, such as the Partial Dependence Plots we will discuss in the next section.

Was this section helpful?

References

Gradient Tree Boosting (Scikit-learn documentation), scikit-learn developers, 2023 - Official documentation explaining the implementation of Gradient Boosting models in Scikit-learn and how to access feature importance scores using the feature_importances_ attribute.
Greedy Function Approximation: A Gradient Boosting Machine, Jerome H. Friedman, 2001 The Annals of Statistics, Vol. 29 (Institute of Mathematical Statistics) DOI: 10.1214/aos/1013203451 - Foundational paper introducing the Gradient Boosting Machine (GBM) algorithm, providing theoretical underpinnings for its construction and performance.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, 2009 (Springer) - A comprehensive textbook covering various statistical learning methods, including detailed discussions on decision trees, ensemble methods, and the concept of feature importance derived from impurity reduction.