Real-world datasets are rarely complete. Missing values, often represented as NaN
(Not a Number), None
, or other placeholders, are common due to data entry errors, sensor malfunctions, or respondents skipping questions. Most machine learning algorithms in Scikit-learn cannot handle missing values directly and will raise an error if they encounter them during training or prediction. Therefore, addressing missing data is a significant step in the preprocessing pipeline.
There are several ways to handle missing values, each with its own advantages and disadvantages.
Before deciding on a strategy, you first need to identify where the missing values are. Pandas DataFrames provide convenient methods for this:
import pandas as pd
import numpy as np
# Sample DataFrame with missing values
data = {'Age': [25, 30, np.nan, 35, 40],
'Salary': [50000, 60000, 75000, np.nan, 90000],
'City': ['New York', 'London', 'Paris', 'Tokyo', np.nan]}
df = pd.DataFrame(data)
print("DataFrame:")
print(df)
# Check for missing values (returns boolean DataFrame)
print("\nMissing values check:")
print(df.isnull())
# Get count of missing values per column
print("\nMissing values count per column:")
print(df.isnull().sum())
This output helps quantify the extent of the missing data problem in each feature.
One straightforward approach is to simply remove data points or features with missing values.
df.dropna()
.df.drop()
. However, this should be done cautiously, as even columns with many missing values might hold some predictive power.Deletion is generally not the preferred method unless the amount of missing data is very small or a column is clearly unusable.
Imputation involves filling in missing values with substitute estimates. This preserves the dataset's size but requires careful consideration of the imputation method to avoid introducing bias or distorting the data's distribution. Scikit-learn provides the SimpleImputer
class within its impute
module for basic imputation tasks.
SimpleImputer
works like other Scikit-learn transformers, following the familiar fit
and transform
pattern.
This strategy replaces missing values in a numerical column with the mean of the observed values in that same column.
This strategy replaces missing values in a numerical column with the median of the observed values.
This strategy replaces missing values with the most frequent value (mode) in the column.
This strategy replaces missing values with a fixed constant value specified by the user (e.g., 0, -1, or "Missing").
Let's see how to apply these strategies using SimpleImputer
. We'll use the sample DataFrame created earlier.
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
# Sample DataFrame (same as before)
data = {'Age': [25, 30, np.nan, 35, 40],
'Salary': [50000, 60000, 75000, np.nan, 90000],
'City': ['New York', 'London', 'Paris', 'Tokyo', np.nan]}
df = pd.DataFrame(data)
# Separate numerical and categorical columns for imputation
df_numeric = df[['Age', 'Salary']]
df_categorical = df[['City']]
# --- Mean Imputation (for numerical) ---
mean_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# Fit on the data to learn the means
mean_imputer.fit(df_numeric)
# Transform the data (replace NaNs)
df_numeric_mean_imputed = mean_imputer.transform(df_numeric)
print("\nMean Imputed Numerical Data:")
# Convert back to DataFrame for clarity
print(pd.DataFrame(df_numeric_mean_imputed, columns=df_numeric.columns))
# --- Median Imputation (for numerical) ---
median_imputer = SimpleImputer(missing_values=np.nan, strategy='median')
median_imputer.fit(df_numeric)
df_numeric_median_imputed = median_imputer.transform(df_numeric)
print("\nMedian Imputed Numerical Data:")
print(pd.DataFrame(df_numeric_median_imputed, columns=df_numeric.columns))
# --- Most Frequent Imputation (for categorical) ---
# Note: SimpleImputer works on numerical representations.
# For categorical, you'd typically apply this *after* encoding,
# or use Pandas' fillna directly before encoding.
# Here we apply it directly for demonstration, it will treat strings as 'objects'.
most_frequent_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
most_frequent_imputer.fit(df_categorical)
df_categorical_mf_imputed = most_frequent_imputer.transform(df_categorical)
print("\nMost Frequent Imputed Categorical Data:")
print(pd.DataFrame(df_categorical_mf_imputed, columns=df_categorical.columns))
# --- Constant Imputation (e.g., for numerical) ---
constant_imputer_num = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)
constant_imputer_num.fit(df_numeric)
df_numeric_const_imputed = constant_imputer_num.transform(df_numeric)
print("\nConstant (0) Imputed Numerical Data:")
print(pd.DataFrame(df_numeric_const_imputed, columns=df_numeric.columns))
# --- Constant Imputation (e.g., for categorical) ---
constant_imputer_cat = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='Unknown')
constant_imputer_cat.fit(df_categorical)
df_categorical_const_imputed = constant_imputer_cat.transform(df_categorical)
print("\nConstant ('Unknown') Imputed Categorical Data:")
print(pd.DataFrame(df_categorical_const_imputed, columns=df_categorical.columns))
Important Note on fit
and transform
: Just like scalers and encoders, imputers must be fitted only on the training data. The statistics learned (e.g., mean, median, mode) from the training data are then used to transform both the training and the test data. This prevents data leakage, where information from the test set inadvertently influences the preprocessing steps. This is handled automatically when using imputers within a Scikit-learn Pipeline, which we will cover later.
Choosing the right imputation strategy depends on the nature of the data (numerical vs. categorical), the distribution of the feature, the percentage of missing values, and the specific requirements of the machine learning algorithm you plan to use. While SimpleImputer
covers basic techniques, Scikit-learn also offers more sophisticated imputers like KNNImputer
(uses K-Nearest Neighbors to estimate values) and IterativeImputer
(models each feature with missing values as a function of other features) for potentially more accurate results, though these are computationally more expensive.
© 2025 ApX Machine Learning