As introduced earlier in this chapter, filter methods evaluate features based on their intrinsic properties, independent of any specific machine learning model. One common technique in this category is correlation analysis, specifically used to identify and potentially remove redundant features.
Redundancy occurs when multiple features convey very similar information. For instance, if you have features for "temperature in Celsius" and "temperature in Fahrenheit," they are perfectly correlated and one is redundant. Including highly correlated features can sometimes destabilize certain models (like linear regression due to multicollinearity) and adds unnecessary complexity without improving predictive power. Correlation analysis helps us spot these relationships.
We typically use the Pearson correlation coefficient, denoted as r, to measure the linear relationship between two numerical features. This coefficient ranges from -1 to +1:
Values between 0 and ±1 indicate the strength and direction of the linear association. A value close to ±1 signifies a strong linear relationship, while a value close to 0 suggests a weak or non-existent linear relationship.
In Python, the Pandas library makes it straightforward to compute the pairwise correlation between all numerical columns in a DataFrame using the .corr()
method.
import pandas as pd
import numpy as np
# Assume 'df' is your DataFrame with numerical features
# Select only numerical columns if df contains mixed types
numerical_df = df.select_dtypes(include=np.number)
# Calculate the correlation matrix
correlation_matrix = numerical_df.corr()
# Display the matrix (optional)
print(correlation_matrix)
The result is a square matrix where the entry at row i
and column j
is the correlation coefficient between feature i
and feature j
. The diagonal elements are always 1 (correlation of a feature with itself).
While the matrix contains the raw numbers, visualizing it as a heatmap often provides a much clearer picture, especially when dealing with many features. We can use libraries like Matplotlib or Seaborn, or generate interactive plots with Plotly.
Example correlation heatmap visualizing the strength and direction of linear relationships between four features. Red indicates positive correlation, blue indicates negative correlation, and lighter colors indicate weaker correlation.
The core idea is to identify pairs of features with a high absolute correlation (e.g., ∣r∣>0.8 or ∣r∣>0.9) and then decide which feature(s) to remove.
Here's a programmatic approach to find and collect features to drop based on a threshold:
import pandas as pd
import numpy as np
# Assume 'correlation_matrix' is calculated as above
# Assume 'threshold' is defined (e.g., threshold = 0.9)
# Use absolute correlation
abs_corr_matrix = correlation_matrix.abs()
# Get the upper triangle of the correlation matrix (excluding the diagonal)
upper_triangle = abs_corr_matrix.where(np.triu(np.ones(abs_corr_matrix.shape), k=1).astype(bool))
# Find features with correlation above the threshold
to_drop = set() # Use a set to avoid duplicates
for column in upper_triangle.columns:
highly_correlated_with = upper_triangle.index[upper_triangle[column] > threshold].tolist()
if highly_correlated_with:
# Basic strategy: Drop the current column if it's highly correlated with any previous one
# More sophisticated strategies could be implemented here
# For simplicity, let's collect the column name itself if it correlates highly
# with any feature already checked (index < column)
if any(upper_triangle[column] > threshold):
to_drop.add(column) # Example: Add the second feature in the pair
print(f"Features to consider dropping (based on threshold {threshold}): {list(to_drop)}")
# Drop the selected features from the original DataFrame
# df_reduced = df.drop(columns=list(to_drop))
# print(f"Original number of features: {df.shape[1]}")
# print(f"Number of features after correlation filtering: {df_reduced.shape[1]}")
Note: The logic for deciding which feature in a pair to drop needs careful consideration. The example above uses a simple approach; more refined logic might involve comparing correlations with the target or other metrics. Be careful not to accidentally drop both features in a highly correlated pair by iterating incorrectly. Using the upper triangle helps avoid redundant checks and self-correlation.
While useful, correlation analysis as a filter method has limitations:
Despite these limitations, checking for and removing highly correlated numerical features is a standard and often beneficial step in the feature selection process, helping to create simpler, more stable models by reducing feature redundancy.
© 2025 ApX Machine Learning