Filter methods offer a computationally inexpensive way to prune features before feeding them into a machine learning model. They assess the intrinsic properties of features, often using statistical measures, without considering the predictive model that will eventually be used. One of the simplest filter techniques is based on feature variance.
The core idea behind variance thresholding is straightforward: features that show little change across the dataset are unlikely to be informative. If a feature has the same value for almost all samples, it provides very little information to distinguish between them. In the extreme case, a feature with zero variance is constant across all samples and clearly cannot help a model make predictions.
Removing low-variance features is a basic data cleaning step. By eliminating features that are constant or nearly constant, we can reduce the dimensionality of the dataset, potentially speeding up model training and reducing complexity without significantly impacting performance.
Recall that the variance of a feature X measures the spread of its values around the mean (μ). It's calculated as:
Var(X)=N1∑i=1N(xi−μ)2
where N is the number of samples, and xi is the value of the feature for the i-th sample. A variance close to zero indicates that the data points are clustered tightly around the mean.
For binary features (taking values 0 or 1), the variance is calculated as p(1−p), where p is the proportion of samples with the value 1. If p is close to 0 or 1 (meaning the feature is almost always 0 or almost always 1), the variance will be close to 0.
Scikit-learn provides the VarianceThreshold
transformer in its feature_selection
module for this purpose.
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
# Sample Data with low-variance features
data = {
'feat1': [0.1, 0.12, 0.09, 0.11, 0.1], # Low variance numerical
'feat2': [10, 12, 9, 110, 50], # High variance numerical
'feat3': [0, 0, 0, 0, 0], # Zero variance (constant)
'feat4': [1, 1, 1, 1, 0], # Low variance binary (mostly 1)
'feat5': [0, 1, 0, 1, 0] # Higher variance binary
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\nVariances:")
print(df.var())
# 1. Remove zero-variance features (default threshold=0)
selector_zero = VarianceThreshold(threshold=0.0)
df_zero_removed = selector_zero.fit_transform(df)
# Get names of selected columns
cols_zero_removed = selector_zero.get_feature_names_out(input_features=df.columns)
df_zero_removed = pd.DataFrame(df_zero_removed, columns=cols_zero_removed)
print("\nDataFrame after removing zero-variance features:")
print(df_zero_removed)
# 2. Remove features with variance below a specific threshold (e.g., 0.1)
# Note: For non-boolean features, consider scaling first if they are on different scales.
# Here, we apply it directly for demonstration.
selector_low = VarianceThreshold(threshold=0.1) # Example threshold
df_low_removed = selector_low.fit_transform(df)
cols_low_removed = selector_low.get_feature_names_out(input_features=df.columns)
df_low_removed = pd.DataFrame(df_low_removed, columns=cols_low_removed)
print("\nDataFrame after removing features with variance < 0.1:")
print(df_low_removed)
Output:
Original DataFrame:
feat1 feat2 feat3 feat4 feat5
0 0.10 10 0 1 0
1 0.12 12 0 1 1
2 0.09 9 0 1 0
3 0.11 110 0 1 1
4 0.10 50 0 0 0
Variances:
feat1 0.000130
feat2 1813.200000
feat3 0.000000
feat4 0.160000
feat5 0.240000
dtype: float64
DataFrame after removing zero-variance features:
feat1 feat2 feat4 feat5
0 0.10 10.0 1.0 0.0
1 0.12 12.0 1.0 1.0
2 0.09 9.0 1.0 0.0
3 0.11 110.0 1.0 1.0
4 0.10 50.0 0.0 0.0
DataFrame after removing features with variance < 0.1:
feat2 feat4 feat5
0 10.0 1.0 0.0
1 12.0 1.0 1.0
2 9.0 1.0 0.0
3 110.0 1.0 1.0
4 50.0 0.0 0.0
In the first step, using the default threshold=0.0
, VarianceThreshold
identified and removed feat3
because it was constant.
In the second step, with threshold=0.1
, it removed feat1
(variance ~0.00013) and feat3
(variance 0), keeping feat2
, feat4
, and feat5
as their variances were above 0.1.
StandardScaler
or MinMaxScaler
) before applying VarianceThreshold
with a non-zero threshold. Otherwise, you might unfairly discard features simply because their units result in smaller numerical values. The default threshold of 0 (removing constant features) does not require prior scaling.VarianceThreshold
only looks at the variance of the feature (X) itself; it doesn't consider any relationship between the feature and the target variable (y). A low-variance feature might, in rare cases, still be predictive. Conversely, a high-variance feature might be completely irrelevant to the target.Variance thresholding serves as a basic sanity check and preprocessing step, useful for quickly removing obviously uninformative features, especially constant ones. It's often used as a first pass before applying more sophisticated feature selection techniques.
© 2025 ApX Machine Learning