All Courses

Multivariate Imputation: KNN Imputer

While simple imputation methods like replacing missing values with the mean, median, or mode are fast and easy to implement, they have a significant limitation: they ignore the relationships between features. If a feature's value is correlated with other features, using a simple statistic might lead to suboptimal or biased results. Multivariate imputation techniques address this by considering the values of other features when estimating the missing data.

The K-Nearest Neighbors (KNN) Imputer is a popular multivariate approach. Instead of using a simple statistic from the target column alone, it looks at the entire feature set to find data points (samples or rows) that are most similar to the one with the missing value. The missing value is then estimated based on the values observed in these neighboring points.

The Intuition Behind KNN Imputation

Imagine you have a dataset of houses with features like square footage, number of bedrooms, year built, and sale price. Suppose the 'year built' is missing for one house. Simple imputation might fill it with the average 'year built' of all houses. KNN Imputation, however, would look for other houses in the dataset that are similar in terms of square footage, number of bedrooms, and sale price (the features without missing values for that row). It identifies the 'k' most similar houses (the nearest neighbors) and then uses their 'year built' values (e.g., by averaging them) to estimate the missing value for the target house.

The core idea rests on the assumption that a data point is likely to be similar to its neighbors in the feature space.

How KNN Imputer Works

Distance Calculation: For a sample containing a missing value in feature $X_i$ , the imputer calculates the distance between this sample and all other samples. This calculation uses only the features present in both samples being compared. A common distance metric is the Euclidean distance, adapted to handle missing values (often referred to as nan_euclidean_distances in implementations).
Identify Neighbors: It identifies the $k$ samples (rows) with the smallest distances to the sample with the missing value. These are the 'k' nearest neighbors.
Imputation: The missing value in feature $X_i$ is imputed using the values of feature $X_i$ from the $k$ neighbors. If weights='uniform', the imputation is typically the average (for numerical features) of the neighbors' values. If weights='distance', closer neighbors are given more influence in the calculation, meaning the imputation value is a weighted average where weights are inversely proportional to the distance.

Implementation with Scikit-learn

Scikit-learn provides a convenient KNNImputer class within its impute module.

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler # Often needed before KNNImputer

# Sample data with missing values
data = {'FeatureA': [1, 2, np.nan, 4, 5, 6, 7, 9, 10],
        'FeatureB': [2, 4, 6, 8, 10, 11, 12, np.nan, 20],
        'FeatureC': [5, 10, 15, 20, 25, 30, 35, 40, 45]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# IMPORTANT: KNNImputer is sensitive to feature scaling
# Scale features before applying KNNImputer
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Initialize KNNImputer
# n_neighbors: Number of neighbors to use (k)
# weights: 'uniform' or 'distance'
imputer = KNNImputer(n_neighbors=3, weights='uniform')

# Fit and transform the scaled data
df_imputed_scaled = pd.DataFrame(imputer.fit_transform(df_scaled), columns=df.columns)

# Inverse transform to get data back in original scale
df_imputed = pd.DataFrame(scaler.inverse_transform(df_imputed_scaled), columns=df.columns)

print("\nDataFrame after KNN Imputation:")
print(df_imputed)

# Verify imputed values (e.g., for FeatureA, row 2)
print(f"\nImputed value for FeatureA at index 2: {df_imputed.loc[2, 'FeatureA']:.2f}")
# Verify imputed values (e.g., for FeatureB, row 7)
print(f"\nImputed value for FeatureB at index 7: {df_imputed.loc[7, 'FeatureB']:.2f}")

In this example, n_neighbors=3 means the algorithm finds the 3 closest neighbors to impute a missing value.

Parameters

n_neighbors: The number of neighboring samples ( $k$ ) used for imputation. A smaller $k$ makes the imputation more sensitive to local patterns but also potentially more susceptible to noise. A larger $k$ provides smoother estimates but might obscure local variations.
weights: Determines how neighbor values contribute.
- 'uniform': All $k$ neighbors contribute equally (simple average).
- 'distance': Closer neighbors have a stronger influence. The contribution is weighted by the inverse of the distance.
metric: The distance metric used to find neighbors. The default nan_euclidean handles missing values appropriately during distance calculation.

Advantages of KNN Imputer

Considers Feature Relationships: Unlike simple methods, it uses information from other features, potentially leading to more accurate imputations, especially when features are correlated.
Non-parametric: It makes fewer assumptions about the data distribution compared to model-based imputation like IterativeImputer.
Versatile: Can theoretically handle different types of data if an appropriate distance metric is used (though standard implementations work best with numerical data).

Disadvantages and Considerations

Computational Cost: Finding nearest neighbors can be computationally intensive, especially for large datasets (many rows or many features). The complexity grows significantly with the number of samples.
Sensitivity to Scaling: Because it's distance-based, KNN Imputer is highly sensitive to the scale of features. Features with larger ranges can dominate the distance calculation. It's almost always necessary to scale your numerical features (e.g., using MinMaxScaler or StandardScaler) before applying KNNImputer. Remember to fit the scaler only on the training data and transform both training and test sets.
Choice of $k$ : The performance depends on the choice of $k$ . This often requires tuning, perhaps using cross-validation on a downstream task.
Curse of Dimensionality: In very high-dimensional spaces, the concept of 'nearest' neighbors becomes less meaningful as distances tend to become more uniform. Its effectiveness might decrease as the number of features grows very large relative to the number of samples.
Categorical Data: Standard KNNImputer expects numerical input. Categorical features need to be appropriately encoded (e.g., one-hot encoding) before applying the imputer. However, one-hot encoding can significantly increase dimensionality, potentially impacting performance.

KNN Imputer offers a more sophisticated way to handle missing data compared to simple strategies by leveraging inter-feature relationships. However, its computational cost and sensitivity to scaling and the choice of $k$ mean it should be applied thoughtfully, particularly after ensuring features are appropriately preprocessed (scaled). It often provides a good balance between imputation quality and complexity when simple methods are insufficient.

Was this section helpful?