Wrapper methods evaluate feature subsets by training and testing a specific machine learning model. This approach differs from filter methods, which assess features independently of any model. Because wrapper methods use the model's performance as the evaluation criterion, they can often find feature subsets that are better tuned for a specific algorithm, potentially capturing interactions between features that filter methods might miss. However, this comes at a higher computational cost, as multiple models need to be trained.
Recursive Feature Elimination (RFE) is a popular and intuitive wrapper method. It works by recursively removing the least important features based on an external estimator that assigns weights or importance scores to features (like the coefficients of a linear model or the feature importances of a tree-based model).
The core idea behind RFE is straightforward:
feature_importances_
attribute.This iterative process progressively prunes the feature set, aiming to retain the features that contribute most significantly to the model's predictive power according to the chosen estimator.
The iterative process of Recursive Feature Elimination (RFE).
The choice of the estimator within RFE is significant because the feature importance ranking depends entirely on it.
LinearRegression
, LogisticRegression
, or LinearSVC
use coefficient magnitudes (coef_
) to rank features. Features with larger absolute coefficients are considered more important. Regularization (like L1 or L2) within these models can influence the coefficients and thus the ranking.DecisionTreeClassifier
, RandomForestClassifier
, or GradientBoostingClassifier
use the feature_importances_
attribute, which typically measures how much each feature contributes to reducing impurity (like Gini impurity or entropy) across all trees in the forest or ensemble.It's generally a good idea to use an estimator in RFE that is similar to the final model you intend to deploy, although simpler, faster models (like linear models) are often used for the selection process itself to manage computational cost.
Scikit-learn provides a convenient implementation of RFE via the sklearn.feature_selection.RFE
class.
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# Generate synthetic data
X, y = make_classification(n_samples=100, n_features=10, n_informative=5,
n_redundant=2, n_classes=2, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(10)])
# 1. Choose an estimator
estimator = LogisticRegression(solver='liblinear')
# 2. Initialize RFE
# Select the top 5 features
rfe_selector = RFE(estimator=estimator, n_features_to_select=5, step=1) # step=1 removes 1 feature per iteration
# 3. Fit RFE to the data
rfe_selector = rfe_selector.fit(X, y)
# 4. Get the results
selected_features_mask = rfe_selector.support_
feature_ranking = rfe_selector.ranking_
print("Feature Names:", X.columns.tolist())
print("Selected Mask:", selected_features_mask) # True for selected features
print("Feature Ranking:", feature_ranking) # 1 indicates selected, higher numbers were eliminated earlier
# Get the names of the selected features
selected_feature_names = X.columns[selected_features_mask]
print("\nSelected Features:", selected_feature_names.tolist())
# Transform the data to keep only selected features
X_selected = rfe_selector.transform(X)
print("\nShape of original data:", X.shape)
print("Shape of data after RFE:", X_selected.shape)
In this example:
LogisticRegression
as the base estimator.n_features_to_select=5
, telling RFE to stop when only the 5 best features remain.support_
attribute returns a boolean mask indicating which features were selected.ranking_
attribute gives a rank to each feature (1 for selected, 2 for the first eliminated, and so on).transform
method can be used to directly obtain the dataset with only the selected features.A common challenge with RFE is knowing how many features (n_features_to_select
) to keep. Selecting too few might discard useful information, while selecting too many might retain noise or redundant features.
Scikit-learn offers RFECV
(RFE with Cross-Validation), which automatically determines the optimal number of features. It performs RFE within a cross-validation loop, evaluating model performance for different numbers of selected features.
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
# Use RFECV to find the optimal number of features
cv_strategy = StratifiedKFold(n_splits=5) # Use stratified k-fold for classification
rfecv_selector = RFECV(estimator=estimator,
step=1,
cv=cv_strategy,
scoring='accuracy', # Performance metric to optimize
min_features_to_select=1) # Minimum features to consider
rfecv_selector = rfecv_selector.fit(X, y)
print(f"\nOptimal number of features found by RFECV: {rfecv_selector.n_features_}")
print("Selected Features by RFECV:", X.columns[rfecv_selector.support_].tolist())
# You can optionally plot performance vs number of features
# The grid_scores_ attribute (or cv_results_['mean_test_score'] in newer versions)
# contains the scores for each number of features tested.
# (Plotting code omitted for brevity)
RFECV
uses the specified cross-validation strategy (cv
) and performance metric (scoring
) to find the number of features that yields the best average score across the folds. This is generally a more data-driven way to choose the feature subset size compared to manually setting n_features_to_select
.
Advantages:
Disadvantages:
RFECV
). This can be slow for large datasets or complex models.RFE is a valuable tool when you suspect feature interactions are relevant and when the computational cost is manageable. Using RFECV
provides a more automated and often more reliable way to apply RFE by integrating cross-validation for determining the number of features to retain.
© 2025 ApX Machine Learning