In the preceding sections, we explored the rationale behind feature selection and surveyed the three main categories of techniques: filter, wrapper, and embedded methods. Now, let's put this knowledge into practice. We'll use Python's Scikit-learn library to apply these methods to a synthetic dataset, demonstrating how to reduce dimensionality effectively.
First, let's set up our environment and generate some data. We'll create a classification dataset with several informative features, a few redundant ones, and some noise features using make_classification
. This setup mimics real-world scenarios where not all collected data contributes positively to model performance.
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import seaborn as sns # Using seaborn for easier plotting styling
# Generate a synthetic dataset
# 20 features total: 8 informative, 4 redundant, 8 noise
X, y = make_classification(n_samples=500, n_features=20,
n_informative=8, n_redundant=4,
n_repeated=0, n_classes=2,
n_clusters_per_class=2,
flip_y=0.05, # Add some noise to labels
class_sep=0.7,
random_state=42)
# Convert to DataFrame for easier handling
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=feature_names)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# It's often good practice to scale data, especially for methods like RFE with Logistic Regression or Lasso
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert scaled arrays back to DataFrames for clarity (optional, but helps keep track of feature names)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_names)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_names)
print("Original training data shape:", X_train.shape)
Our initial dataset has 20 features. Our goal is to use feature selection techniques to identify and keep only the most relevant ones.
Filter methods evaluate features based on intrinsic properties, independently of any specific model.
This is the simplest approach, removing features whose variance doesn't meet a certain threshold. It's useful for eliminating constant or quasi-constant features. Let's remove features with zero variance (though our synthetic data likely won't have any, it's good practice).
# Initialize VarianceThreshold (default threshold=0.0 removes constant features)
selector_vt = VarianceThreshold()
# Fit on training data
selector_vt.fit(X_train_scaled)
# Get the features to keep
features_to_keep = X_train_scaled.columns[selector_vt.get_support()]
print(f"Features kept after VarianceThreshold: {len(features_to_keep)}/{X_train_scaled.shape[1]}")
# print("Kept features:", features_to_keep.tolist()) # Uncomment to see names
# Transform the data (usually you'd transform both train and test)
X_train_vt = selector_vt.transform(X_train_scaled)
X_test_vt = selector_vt.transform(X_test_scaled)
print("Shape after VarianceThreshold:", X_train_vt.shape)
In this case, VarianceThreshold
likely kept all features, as make_classification
doesn't typically produce constant features unless specified. You can adjust the threshold
parameter to remove low-variance features if needed.
These methods use statistical tests to score each feature's relationship with the target variable. We'll use SelectKBest
with the ANOVA F-value test (f_classif
), suitable for numerical features and a categorical target.
# Select the top 10 features based on ANOVA F-value
k = 10
selector_kbest = SelectKBest(score_func=f_classif, k=k)
# Fit on the scaled training data and target
selector_kbest.fit(X_train_scaled, y_train)
# Get the selected feature names
kbest_features = X_train_scaled.columns[selector_kbest.get_support()]
print(f"Selected top {k} features using SelectKBest (f_classif):")
print(kbest_features.tolist())
# Transform the data
X_train_kbest = selector_kbest.transform(X_train_scaled)
X_test_kbest = selector_kbest.transform(X_test_scaled)
print("Shape after SelectKBest:", X_train_kbest.shape)
# You can also inspect the scores
feature_scores = pd.DataFrame({'Feature': X_train_scaled.columns, 'Score': selector_kbest.scores_})
print("\nTop 5 features by ANOVA F-score:")
print(feature_scores.sort_values(by='Score', ascending=False).head())
SelectKBest
provides a quick way to rank features based on their individual predictive power according to the chosen statistical test. Remember that f_classif
assesses linear relationships; non-linear relationships might be missed. For categorical features, you would use chi2
.
Wrapper methods use a specific machine learning model to evaluate subsets of features.
RFE recursively fits a model, ranks features (by coefficient magnitude or feature importance), and removes the weakest one(s) until the desired number remains. We'll use LogisticRegression
as the estimator.
# Initialize the estimator
estimator = LogisticRegression(solver='liblinear', random_state=42)
# Initialize RFE to select 8 features
# Note: RFE works best with estimators that provide coefficient weights or feature importances
selector_rfe = RFE(estimator=estimator, n_features_to_select=8, step=1) # step=1 removes 1 feature per iteration
# Fit RFE on the scaled training data
selector_rfe.fit(X_train_scaled, y_train)
# Get the selected feature names
rfe_features = X_train_scaled.columns[selector_rfe.support_]
print(f"Selected {selector_rfe.n_features_} features using RFE (LogisticRegression):")
print(rfe_features.tolist())
# Transform the data
X_train_rfe = selector_rfe.transform(X_train_scaled)
X_test_rfe = selector_rfe.transform(X_test_scaled)
print("Shape after RFE:", X_train_rfe.shape)
# RFE with Cross-Validation (RFECV) can find the optimal number of features
# estimator_cv = LogisticRegression(solver='liblinear', random_state=42)
# selector_rfecv = RFECV(estimator=estimator_cv, step=1, cv=5, scoring='accuracy') # Use appropriate scoring
# selector_rfecv.fit(X_train_scaled, y_train)
# print(f"\nOptimal number of features found by RFECV: {selector_rfecv.n_features_}")
# rfecv_features = X_train_scaled.columns[selector_rfecv.support_]
# print("Selected features by RFECV:", rfecv_features.tolist())
RFE is more computationally intensive than filter methods because it involves training the estimator multiple times. However, it considers feature interactions implicitly through the model's evaluation. RFECV
automates finding the optimal feature count based on cross-validated performance.
Embedded methods perform feature selection as part of the model training process.
Linear models with L1 regularization, like Lasso, tend to shrink the coefficients of less important features exactly to zero, effectively performing feature selection. We'll use LogisticRegression
with L1 penalty.
# Using Logistic Regression with L1 penalty
# The 'C' parameter is the inverse of regularization strength; smaller C means stronger regularization
# We'll use a fixed C, but often LassoCV or GridSearchCV is used to find the optimal C
l1_estimator = LogisticRegression(penalty='l1', C=0.1, solver='liblinear', random_state=42)
l1_estimator.fit(X_train_scaled, y_train)
# Coefficients that are non-zero correspond to selected features
l1_coeffs = l1_estimator.coef_[0]
l1_selected_features = X_train_scaled.columns[l1_coeffs != 0]
print(f"Selected features using Logistic Regression (L1 penalty, C=0.1): {len(l1_selected_features)}")
print(l1_selected_features.tolist())
# Create a DataFrame for coefficients
coef_df = pd.DataFrame({'Feature': X_train_scaled.columns, 'Coefficient': l1_coeffs})
print("\nL1 Coefficients:")
# Display only non-zero coefficients for brevity
print(coef_df[coef_df['Coefficient'] != 0].sort_values(by='Coefficient', key=abs, ascending=False))
Lasso is efficient and directly integrates selection with model fitting. The strength of regularization (controlled by C
in LogisticRegression
or alpha
in LassoCV
) determines how many features are kept.
Tree-based ensemble methods like Random Forests naturally compute feature importances during training, based on how much each feature contributes to reducing impurity (e.g., Gini impurity or entropy) across all trees.
# Initialize RandomForestClassifier
rf_estimator = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
# Fit the model
rf_estimator.fit(X_train_scaled, y_train)
# Get feature importances
importances = rf_estimator.feature_importances_
importance_df = pd.DataFrame({'Feature': X_train_scaled.columns, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
print("\nFeature Importances from RandomForest:")
print(importance_df)
# Select features based on importance (e.g., keep top 10 or above a certain threshold)
threshold = 0.02 # Example threshold - adjust based on distribution
rf_selected_features = importance_df[importance_df['Importance'] > threshold]['Feature']
print(f"\nSelected features with importance > {threshold}: {len(rf_selected_features)}")
print(rf_selected_features.tolist())
# Visualize Feature Importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df.head(15), palette='viridis') # Display top 15
plt.title('Top 15 Feature Importances from RandomForestClassifier')
plt.tight_layout()
plt.show() # In a web environment, you might render this using a library like Plotly
# Generate Plotly JSON for web rendering (example for top 10)
top_10_importance = importance_df.head(10).sort_values(by='Importance', ascending=True) # Ascending for horizontal bar chart
plotly_fig_json = f'''
```plotly
{{
"data": [
{{
"type": "bar",
"y": {top_10_importance['Feature'].tolist()},
"x": {top_10_importance['Importance'].tolist()},
"orientation": "h",
"marker": {{"color": "#20c997"}}
}}
],
"layout": {{
"title": "Top 10 Feature Importances (Random Forest)",
"yaxis": {{"title": "Feature"}},
"xaxis": {{"title": "Importance Score"}},
"height": 400,
"margin": {{"l": 120, "r": 20, "t": 50, "b": 50}}
}}
}}
''' print(plotly_fig_json)
> Feature importances calculated by a Random Forest model, indicating the relative contribution of each feature to the model's predictions. Higher scores suggest greater importance.
Tree-based importances are powerful as they can capture non-linear relationships and feature interactions. However, correlated features might split importance, potentially underestimating their collective value.
### Integrating Selection into Pipelines
A crucial aspect of feature selection is applying it correctly within a machine learning workflow, especially when using cross-validation. Feature selection should ideally be performed *inside* each cross-validation fold, using only the training data for that fold to avoid data leakage from the validation set into the selection process. Scikit-learn's `Pipeline` object is perfect for this.
```python
# Example: Pipeline combining RFE and Logistic Regression
pipeline = Pipeline([
('scaler', StandardScaler()), # Step 1: Scale data
('selector', RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=8)), # Step 2: Select features
('classifier', LogisticRegression(solver='liblinear')) # Step 3: Train final model
])
# Now, you can fit the entire pipeline
pipeline.fit(X_train, y_train) # Fit on original training data
# Evaluate on the test set
accuracy = pipeline.score(X_test, y_test)
print(f"\nPipeline Accuracy (Scaler -> RFE -> LogisticRegression): {accuracy:.4f}")
# You could also use GridSearchCV with a pipeline to tune hyperparameters
# including the number of features to select in RFE or the C parameter in Lasso.
This hands-on practical demonstrated applying filter, wrapper, and embedded feature selection methods using Scikit-learn. You saw how to remove low-variance features, select features based on statistical tests, use model-based recursive elimination, and leverage regularization or tree importances. Remember that the best method depends on the dataset characteristics, the chosen model, and computational constraints. Incorporating selection into a Pipeline
ensures it's applied correctly during model development and evaluation.
© 2025 ApX Machine Learning