Embedded feature selection methods integrate the selection process directly into the construction of the machine learning model. Following our look at L1 regularization (Lasso), another very common and effective family of embedded techniques relies on tree-based models. Algorithms like Decision Trees, Random Forests, and Gradient Boosting inherently compute a score for each feature's importance during training.
Tree-based models build a series of decision rules by recursively splitting the data based on feature values. The core idea behind feature importance in these models is that features which are more effective at splitting the data and improving the purity of the nodes (for classification) or reducing variance (for regression) are considered more important.
Consider a single decision tree. When the tree algorithm decides where to split a node, it evaluates potential splits across various features and thresholds. The feature and threshold that result in the best split (e.g., the largest reduction in Gini impurity for classification, or the biggest decrease in mean squared error for regression) are chosen. Features that are selected for splits higher up in the tree (closer to the root) or are used in more splits generally have a greater impact on the final prediction.
For ensemble models like Random Forests or Gradient Boosting Machines, which combine predictions from multiple trees, the feature importance is typically calculated as the average importance of that feature across all trees in the ensemble. This averaging process makes the importance scores more robust and reliable compared to those from a single decision tree.
Scikit-learn makes it straightforward to access feature importances after training a tree-based model. Most ensemble tree estimators (like RandomForestClassifier
, RandomForestRegressor
, GradientBoostingClassifier
, GradientBoostingRegressor
, HistGradientBoostingClassifier
, etc.) provide a feature_importances_
attribute once the fit
method has been called.
Let's see a practical example. Assume we have our feature matrix X_train
and target vector y_train
.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Assume X and y are pre-loaded pandas DataFrame and Series
# For demonstration, let's create synthetic data:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
n_redundant=2, n_classes=2, random_state=42)
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=feature_names)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and train a RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)
# Access feature importances
importances = rf_model.feature_importances_
# Create a DataFrame for better visualization
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
print("Feature Importances from RandomForestClassifier:")
print(importance_df)
# You can visualize these importances using a bar chart
# (Plotly JSON for visualization will be generated below)
This code trains a Random Forest model and then extracts the importance scores, printing them in descending order. Visualizing these scores often provides clearer insights.
Feature importances calculated by a Random Forest model, visualized as a horizontal bar chart. Features are ranked from most to least important based on their contribution to the model's predictions (mean decrease in impurity).
Once you have the importance scores, you can use them to select a subset of features. Common strategies include:
N
highest importance scores.Scikit-learn provides the SelectFromModel
meta-transformer, which simplifies this process. It takes an estimator that has a feature_importances_
(or coef_
for linear models) attribute and selects features based on a specified threshold.
from sklearn.feature_selection import SelectFromModel
import numpy as np
# Using SelectFromModel with the trained RandomForest model
# We can set a threshold, e.g., select features with importance > median importance
# Calculate median importance from the fitted model
median_importance = np.median(rf_model.feature_importances_)
selector = SelectFromModel(estimator=rf_model, threshold=median_importance, prefit=True)
# Alternatively, use threshold='median' if you want SelectFromModel to calculate it.
# selector = SelectFromModel(estimator=rf_model, threshold='median', prefit=True) # 'median' requires string input
# 'prefit=True' because rf_model is already trained.
# If prefit=False, SelectFromModel will fit the estimator itself.
# Transform the data to keep only selected features
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)
# Get the names of the selected features
selected_feature_indices = selector.get_support(indices=True)
selected_feature_names = X_train.columns[selected_feature_indices]
print(f"\nOriginal number of features: {X_train.shape[1]}")
print(f"Number of features selected (importance > median): {X_train_selected.shape[1]}")
print(f"Selected features: {selected_feature_names.tolist()}")
SelectFromModel
can use thresholds like "mean"
, "median"
, or a float value (e.g., 0.01
). You can also specify max_features
to select the top N features directly, although this typically requires refitting the estimator within SelectFromModel
unless prefit=True
is used and the threshold inherently selects the desired number.
While tree-based feature importance is widely used and effective, keep these points in mind:
Tree-based feature importance offers a computationally efficient way to estimate feature relevance directly from the models many data scientists use daily. By understanding how these scores are derived and their potential limitations, you can effectively use them as part of your feature selection toolkit, often in combination with SelectFromModel
, to reduce dimensionality and potentially improve model performance and interpretability.
© 2025 ApX Machine Learning