Once you've identified missing values in your dataset, the next step is deciding how to handle them. As discussed, most machine learning algorithms require complete datasets. While deleting rows or columns with missing data is an option (listwise or pairwise deletion), it often leads to significant data loss, especially if missing values are widespread. Imputation, the process of filling in missing values with estimated ones, offers a way to retain data points.
The simplest imputation techniques operate on a single variable at a time, using summary statistics derived from the observed values of that variable. These univariate methods, specifically mean, median, and mode imputation, are easy to understand and implement, making them a common first approach.
Mean imputation replaces all missing values (NaN) within a numerical column with the arithmetic mean of the non-missing values in that same column.
Let's say we have a feature vector xj with some missing entries. The mean xˉj is calculated using only the observed values. For each missing entry xij (where i is the sample index and j is the feature index), we substitute:
xij←xˉj
Implementation with Pandas:
You can easily perform mean imputation using Pandas. Assume df
is your DataFrame and 'Age' is a column with missing values:
import pandas as pd
import numpy as np
# Sample DataFrame
data = {'Age': [25, 30, np.nan, 35, 40, np.nan, 55],
'Salary': [50000, 60000, 75000, np.nan, 80000, 95000, 120000]}
df = pd.DataFrame(data)
# Calculate the mean of the 'Age' column (excluding NaNs)
mean_age = df['Age'].mean()
print(f"Mean Age: {mean_age:.2f}")
# Impute missing values in 'Age' with the mean
df['Age_mean_imputed'] = df['Age'].fillna(mean_age)
print("\nDataFrame after Mean Imputation for Age:")
print(df)
Pros:
Cons:
Mean imputation is generally considered when the amount of missing data is small and the variable has a somewhat symmetrical distribution without significant outliers.
Median imputation is similar to mean imputation, but instead of using the mean, it uses the median (the middle value when the data is sorted) of the observed values in the column to replace missing entries.
For a feature xj, the median median(xj) is calculated from the observed values. Missing entries xij are substituted with:
xij←median(xj)
Implementation with Pandas:
# Calculate the median of the 'Age' column
median_age = df['Age'].median()
print(f"\nMedian Age: {median_age}")
# Impute missing values in 'Age' with the median
df['Age_median_imputed'] = df['Age'].fillna(median_age)
print("\nDataFrame after Median Imputation for Age:")
print(df)
Pros:
Cons:
Median imputation is often preferred over mean imputation for numerical features, especially when dealing with skewed data or potential outliers.
Mean and median are suitable for numerical data, but not for categorical features. For categorical data (or sometimes discrete numerical data with few unique values), mode imputation is used. It replaces missing values with the mode, which is the most frequently occurring value in the column.
For a feature xj, the mode mode(xj) is determined from the observed values. Missing entries xij are substituted with:
xij←mode(xj)
Implementation with Pandas:
Pandas' .mode()
method returns a Series (as there might be multiple modes). We typically select the first one using [0]
.
# Sample DataFrame with a categorical feature
data_cat = {'Color': ['Red', 'Blue', 'Green', np.nan, 'Blue', 'Red', 'Blue', np.nan],
'Size': ['M', 'L', 'S', 'M', np.nan, 'L', 'M', 'S']}
df_cat = pd.DataFrame(data_cat)
# Calculate the mode of the 'Color' column
mode_color = df_cat['Color'].mode()[0]
print(f"\nMode Color: {mode_color}")
# Impute missing values in 'Color' with the mode
df_cat['Color_mode_imputed'] = df_cat['Color'].fillna(mode_color)
# Calculate the mode of the 'Size' column
mode_size = df_cat['Size'].mode()[0]
print(f"Mode Size: {mode_size}")
# Impute missing values in 'Size' with the mode
df_cat['Size_mode_imputed'] = df_cat['Size'].fillna(mode_size)
print("\nDataFrame after Mode Imputation:")
print(df_cat)
Pros:
Cons:
Mode imputation is the standard simple technique for categorical features.
While Pandas .fillna()
is convenient for quick imputation, Scikit-learn provides the SimpleImputer
class, which is particularly useful when integrating imputation into a machine learning pipeline. It ensures that the imputation strategy learned from the training data is consistently applied to any new data (like a test set), preventing data leakage.
SimpleImputer
supports mean, median, mode ('most_frequent'), and constant value imputation.
from sklearn.impute import SimpleImputer
# --- Mean Imputation using SimpleImputer ---
# Reshape needed as SimpleImputer expects 2D array-like input
age_column = df[['Age']] # Select 'Age' column, keeping it as a DataFrame
imputer_mean = SimpleImputer(strategy='mean')
# Fit the imputer on the data (learns the mean)
imputer_mean.fit(age_column)
# Transform the data (apply imputation)
df['Age_sklearn_mean'] = imputer_mean.transform(age_column)
# --- Median Imputation using SimpleImputer ---
imputer_median = SimpleImputer(strategy='median')
imputer_median.fit(age_column) # Fit learns the median
df['Age_sklearn_median'] = imputer_median.transform(age_column) # Transform applies it
# --- Mode Imputation using SimpleImputer ---
# Need to handle categorical data separately, often after encoding
# For demonstration, let's apply it to the 'Size' column (assuming it's treated independently)
size_column = df_cat[['Size']]
# Note: SimpleImputer works best with numerical or encoded categorical data.
# For direct mode imputation on strings, Pandas is often simpler.
# If 'Size' were encoded numerically (e.g., S=0, M=1, L=2), SimpleImputer would work directly.
# Let's simulate mode imputation on 'Color' for illustration (fit/transform)
color_column = df_cat[['Color']]
imputer_mode = SimpleImputer(strategy='most_frequent')
imputer_mode.fit(color_column) # Learns 'Blue' is the mode
df_cat['Color_sklearn_mode'] = imputer_mode.transform(color_column)
print("\nDataFrame after Scikit-learn Mean/Median Imputation (Original DF):")
print(df[['Age', 'Age_sklearn_mean', 'Age_sklearn_median']].head())
print("\nDataFrame after Scikit-learn Mode Imputation (Categorical DF):")
print(df_cat[['Color', 'Color_sklearn_mode']].head())
Using SimpleImputer
within a Pipeline
object (covered later in discussions about ML workflows) is standard practice for robust model building.
Mean, median, and mode imputation are fast and easy baseline methods. However, their major drawback is that they are univariate - they only consider the information within the column containing the missing value, ignoring potential relationships or correlations with other features. This often leads to:
Consider a dataset where height and weight are correlated. Imputing missing weights using only the mean weight ignores the fact that taller people tend to weigh more. A simple mean imputation might assign an average weight to a very tall person, which is likely inaccurate.
The following plot illustrates how mean imputation can distort the distribution of the 'Age' feature from our earlier example.
Distribution of the 'Age' feature before and after mean imputation. Notice the spike at the mean value (37.0) in the imputed data, altering the original distribution shape.
While simple imputation methods provide a quick fix, they often fail to capture the underlying structure of the data. More sophisticated techniques, such as KNN Imputation and Iterative Imputation, leverage information from other features to make more informed estimates, which we will explore next.
© 2025 ApX Machine Learning