While simple imputation methods like replacing missing values with the mean, median, or mode are fast and easy to implement, they have a significant limitation: they ignore the relationships between features. If a feature's value is correlated with other features, using a simple statistic might lead to suboptimal or biased results. Multivariate imputation techniques address this by considering the values of other features when estimating the missing data.
The K-Nearest Neighbors (KNN) Imputer is a popular multivariate approach. Instead of using a simple statistic from the target column alone, it looks at the entire feature set to find data points (samples or rows) that are most similar to the one with the missing value. The missing value is then estimated based on the values observed in these neighboring points.
Imagine you have a dataset of houses with features like square footage, number of bedrooms, year built, and sale price. Suppose the 'year built' is missing for one house. Simple imputation might fill it with the average 'year built' of all houses. KNN Imputation, however, would look for other houses in the dataset that are similar in terms of square footage, number of bedrooms, and sale price (the features without missing values for that row). It identifies the 'k' most similar houses (the nearest neighbors) and then uses their 'year built' values (e.g., by averaging them) to estimate the missing value for the target house.
The core idea rests on the assumption that a data point is likely to be similar to its neighbors in the feature space.
nan_euclidean_distances
in implementations).weights='uniform'
, the imputation is typically the average (for numerical features) of the neighbors' values. If weights='distance'
, closer neighbors are given more influence in the calculation, meaning the imputation value is a weighted average where weights are inversely proportional to the distance.Scikit-learn provides a convenient KNNImputer
class within its impute
module.
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler # Often needed before KNNImputer
# Sample data with missing values
data = {'FeatureA': [1, 2, np.nan, 4, 5, 6, 7, 9, 10],
'FeatureB': [2, 4, 6, 8, 10, 11, 12, np.nan, 20],
'FeatureC': [5, 10, 15, 20, 25, 30, 35, 40, 45]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# IMPORTANT: KNNImputer is sensitive to feature scaling
# Scale features before applying KNNImputer
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
# Initialize KNNImputer
# n_neighbors: Number of neighbors to use (k)
# weights: 'uniform' or 'distance'
imputer = KNNImputer(n_neighbors=3, weights='uniform')
# Fit and transform the scaled data
df_imputed_scaled = pd.DataFrame(imputer.fit_transform(df_scaled), columns=df.columns)
# Inverse transform to get data back in original scale
df_imputed = pd.DataFrame(scaler.inverse_transform(df_imputed_scaled), columns=df.columns)
print("\nDataFrame after KNN Imputation:")
print(df_imputed)
# Verify imputed values (e.g., for FeatureA, row 2)
print(f"\nImputed value for FeatureA at index 2: {df_imputed.loc[2, 'FeatureA']:.2f}")
# Verify imputed values (e.g., for FeatureB, row 7)
print(f"\nImputed value for FeatureB at index 7: {df_imputed.loc[7, 'FeatureB']:.2f}")
In this example, n_neighbors=3
means the algorithm finds the 3 closest neighbors to impute a missing value.
n_neighbors
: The number of neighboring samples (k) used for imputation. A smaller k makes the imputation more sensitive to local patterns but also potentially more susceptible to noise. A larger k provides smoother estimates but might obscure local variations.weights
: Determines how neighbor values contribute.
'uniform'
: All k neighbors contribute equally (simple average).'distance'
: Closer neighbors have a stronger influence. The contribution is weighted by the inverse of the distance.metric
: The distance metric used to find neighbors. The default nan_euclidean
handles missing values appropriately during distance calculation.IterativeImputer
.MinMaxScaler
or StandardScaler
) before applying KNNImputer
. Remember to fit the scaler only on the training data and transform both training and test sets.KNNImputer
expects numerical input. Categorical features need to be appropriately encoded (e.g., one-hot encoding) before applying the imputer. However, one-hot encoding can significantly increase dimensionality, potentially impacting performance.KNN Imputer offers a more sophisticated way to handle missing data compared to simple strategies by leveraging inter-feature relationships. However, its computational cost and sensitivity to scaling and the choice of k mean it should be applied thoughtfully, particularly after ensuring features are appropriately preprocessed (scaled). It often provides a good balance between imputation quality and complexity when simple methods are insufficient.
© 2025 ApX Machine Learning