While VarianceThreshold
helps remove features that don't change much, it doesn't consider the relationship between a feature and the target variable. To select features based on their predictive power for the target, we can turn to univariate statistical tests. These tests evaluate each feature individually (univariately) against the target variable, assigning a score based on the strength of their statistical relationship. Features with the highest scores are deemed more relevant.
Scikit-learn provides convenient tools like SelectKBest
(selects a fixed number k of top features) and SelectPercentile
(selects the top percentage of features) that work in conjunction with various scoring functions based on these statistical tests. The choice of the specific test depends on the data types of the feature and the target variable.
Let's examine two common univariate tests used for feature selection in classification tasks.
When to Use: Use the Analysis of Variance (ANOVA) F-value when you have numerical input features and a categorical target variable.
Concept: ANOVA is a statistical test that checks if the means of a numerical variable are significantly different across two or more groups. In the context of feature selection for classification, each "group" corresponds to a class in your target variable. The F-value, or F-statistic, quantifies the ratio of variance between the groups (classes) to the variance within the groups.
Essentially, we are asking: "Does the value of this numerical feature tend to be different depending on the target class?" If the answer is yes (high F-value, low p-value), the feature is likely informative.
Implementation with Scikit-learn:
Scikit-learn's f_classif
function calculates the ANOVA F-value between numerical features and a categorical target. You typically use it with selectors like SelectKBest
.
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_classif
# Generate synthetic classification data
# 100 samples, 20 features (10 informative, 10 redundant)
X, y = make_classification(n_samples=100, n_features=20, n_informative=10,
n_redundant=5, n_repeated=0, n_classes=2,
n_clusters_per_class=2, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(20)])
print("Original number of features:", X.shape[1])
# Select the top 10 features based on ANOVA F-value
# Use f_classif for numerical input, classification target
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
# Get the scores and p-values
scores = selector.scores_
p_values = selector.pvalues_
# Get the selected feature names (requires fitting selector on DataFrame)
selected_mask = selector.get_support() # Boolean mask
selected_features = X.columns[selected_mask]
print("Selected number of features:", X_selected.shape[1])
print("Selected features:", selected_features.tolist())
# Example: Display scores for the first 5 features
# print("Scores (first 5):", scores[:5])
# print("P-values (first 5):", p_values[:5])
In this example, SelectKBest(f_classif, k=10)
computes the F-statistic for each of the 20 features against the target y
and retains the 10 features with the highest scores (and lowest p-values).
Example ANOVA F-scores (log scale) calculated by
f_classif
. Higher scores suggest stronger relationships with the target variable.SelectKBest
would pick features like 3, 5, 7, 13, and 19.
When to Use: Use the Chi-Squared (χ2) test when you have categorical input features and a categorical target variable.
Concept: The Chi-Squared test assesses the independence between two categorical variables. For feature selection, it measures the dependency between a categorical feature and the categorical target variable. It compares the observed frequencies (how often each feature category appears within each target class) to the expected frequencies (what you'd expect if the feature and target were independent).
Important Constraint: The Chi-Squared test requires feature values to be non-negative, as it's typically applied to counts or frequencies. This means you often apply it after encoding categorical features using methods like One-Hot Encoding.
Implementation with Scikit-learn:
Scikit-learn's chi2
function computes the Chi-Squared statistic between non-negative features and a categorical target.
import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
# Generate synthetic classification data (using previous X, y)
X, y = make_classification(n_samples=100, n_features=5, n_informative=3,
n_redundant=0, n_repeated=0, n_classes=3,
n_clusters_per_class=1, random_state=50)
X = pd.DataFrame(X, columns=[f'num_feat_{i}' for i in range(5)])
# --- Simulate adding categorical features ---
# For demonstration, let's discretize numerical features to simulate categorical ones
# and also add a truly categorical one. NOTE: In practice, you'd have actual categorical data.
# Create bins for numerical features
binner = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform', subsample=None)
X_binned = pd.DataFrame(binner.fit_transform(X), columns=X.columns, dtype=int)
# Add a sample categorical feature
np.random.seed(0)
X_binned['cat_feat_A'] = np.random.choice(['P', 'Q', 'R'], size=X.shape[0])
X_binned['cat_feat_B'] = np.random.choice(['X', 'Y'], size=X.shape[0])
print("Original Features (after binning/adding categoricals):\n", X_binned.head(3))
# One-Hot Encode the binned/categorical features
# Use handle_unknown='ignore' if test data might have unseen categories
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(X_binned)
encoded_feature_names = encoder.get_feature_names_out(X_binned.columns)
X_encoded_df = pd.DataFrame(X_encoded, columns=encoded_feature_names)
print("\nShape after One-Hot Encoding:", X_encoded_df.shape)
# Select top features using Chi-Squared test
# 'k' should be less than or equal to the number of encoded features
k_best_features = 10
selector_chi2 = SelectKBest(score_func=chi2, k=k_best_features)
X_selected_chi2 = selector_chi2.fit_transform(X_encoded_df, y)
# Get selected feature names
selected_mask_chi2 = selector_chi2.get_support()
selected_features_chi2 = X_encoded_df.columns[selected_mask_chi2]
print(f"\nSelected {k_best_features} features using Chi2:", selected_features_chi2.tolist())
In this setup, we first converted numerical features to categorical-like bins and added explicit categorical features. Then, we applied One-Hot Encoding, resulting in multiple binary columns. Finally, SelectKBest(chi2, k=10)
calculated the χ2 statistic for each of these non-negative encoded features against the target y
and selected the top 10.
f_classif
(ANOVA F-value).chi2
(Chi-Squared). Requires non-negative input, often applied after One-Hot Encoding.f_regression
(calculates F-statistic based on simple linear regression).mutual_info_classif
and mutual_info_regression
. These measure the mutual information between each feature and the target. They are non-parametric and can capture non-linear relationships, making them powerful alternatives, though potentially more computationally intensive.Univariate tests are computationally efficient as they examine each feature independently. However, this is also their main limitation: they do not account for interactions between features. A feature might be individually weak but highly predictive when combined with another. Therefore, while useful for an initial assessment or quick dimensionality reduction, relying solely on univariate tests might lead to discarding potentially valuable features that contribute synergistically. They are often best used as a preliminary step or alongside more sophisticated wrapper or embedded methods.
© 2025 ApX Machine Learning