As mentioned earlier, real-world data is often incomplete, containing missing entries represented typically as NaN
(Not a Number) in NumPy arrays or Pandas DataFrames. Many Scikit-learn algorithms cannot process datasets with missing values, so we need a strategy to handle them. Simply dropping rows or columns with missing data might discard valuable information, especially if the dataset isn't very large. A common alternative is imputation: replacing missing values with substitute values derived from the non-missing data.
Scikit-learn provides a convenient transformer for basic imputation tasks: SimpleImputer
. Located in the sklearn.impute
module, it allows you to fill missing values using various strategies.
SimpleImputer
The SimpleImputer
follows the standard Scikit-learn transformer API, meaning it has fit
and transform
methods.
fit(X)
: During the fit
step, the imputer calculates the statistic (e.g., mean, median, mode) from the non-missing values in each column of the training data X
. This learned statistic is stored internally.transform(X)
: The transform
step uses the statistics learned during fit
to fill the missing values (marked as np.nan
or another specified marker) in the input data X
.It's important to fit the imputer only on the training data and then use it to transform both the training and the test data. This prevents data leakage, ensuring that information from the test set doesn't influence the imputation values used for the training set.
SimpleImputer
supports several imputation strategies via the strategy
parameter:
mean
: Replaces missing values using the mean along each column. Suitable only for numerical data. Can be sensitive to outliers.median
: Replaces missing values using the median along each column. Also only for numerical data. Generally more robust to outliers than the mean
.most_frequent
: Replaces missing values using the mode (the most frequent value) along each column. Can be used with both numerical and categorical data.constant
: Replaces missing values with a fixed value specified by the fill_value
parameter. This can be useful if missing values have a specific meaning or if you want to use a placeholder like 0 or -1 for numerical data, or "Missing" for categorical data.Let's see how to apply SimpleImputer
with the median
strategy.
import numpy as np
from sklearn.impute import SimpleImputer
# Sample data with missing values
X = np.array([[1.0, 2.0, np.nan],
[4.0, np.nan, 6.0],
[7.0, 8.0, 9.0],
[np.nan, 11.0, 12.0]])
print("Original Data:")
print(X)
# Initialize SimpleImputer with the 'median' strategy
imputer = SimpleImputer(strategy='median')
# Fit the imputer on the data (learns the medians of each column)
# Column 0 median: (1+4+7)/3 -> 4.0 (median of [1, 4, 7])
# Column 1 median: (2+8+11)/3 -> 8.0 (median of [2, 8, 11])
# Column 2 median: (6+9+12)/3 -> 9.0 (median of [6, 9, 12])
imputer.fit(X)
# Transform the data (fill missing values)
X_imputed = imputer.transform(X)
print("\nImputed Data (Median Strategy):")
print(X_imputed)
# We can also use fit_transform as a shortcut
imputer_mean = SimpleImputer(strategy='mean')
X_imputed_mean = imputer_mean.fit_transform(X)
print("\nImputed Data (Mean Strategy):")
print(X_imputed_mean)
Running this code will first print the original array containing NaN
values. Then, it will show the array after applying median imputation, where NaN
s are replaced by the respective column medians (4.0, 8.0, 9.0). Finally, it shows the result using the mean strategy.
Sometimes, the fact that a value was missing might itself be useful information for a machine learning model. SimpleImputer
allows you to keep track of which values were imputed by setting the add_indicator
parameter to True
. When set, the transformer appends binary indicator columns to the output. Each indicator column corresponds to a column in the input that had missing values, with a 1 marking entries that were originally missing and 0 otherwise.
# Sample data again
X = np.array([[1.0, 2.0, np.nan],
[4.0, np.nan, 6.0],
[7.0, 8.0, 9.0],
[np.nan, 11.0, 12.0]])
# Initialize SimpleImputer with median strategy and indicator
indicator_imputer = SimpleImputer(strategy='median', add_indicator=True)
# Fit and transform
X_imputed_indicated = indicator_imputer.fit_transform(X)
print("\nImputed Data with Missing Value Indicators:")
print(X_imputed_indicated)
# Note the shape difference
print(f"\nOriginal shape: {X.shape}")
print(f"Shape after imputation with indicator: {X_imputed_indicated.shape}")
The output X_imputed_indicated
will have the imputed values in the first three columns (same as X_imputed
in the previous example) and three additional binary columns. The fourth column will have a 1 in row 3 (indicating X[3, 0]
was missing). The fifth column will have a 1 in row 1 (indicating X[1, 1]
was missing). The sixth column will have a 1 in row 0 (indicating X[0, 2]
was missing).
While imputation is a standard technique, it's not without drawbacks. It makes assumptions about the data (e.g., that the median is a reasonable substitute) and can potentially distort relationships between variables or reduce variance. The choice of strategy (mean
, median
, most_frequent
, constant
) should be guided by the nature of your data and the algorithm you intend to use. For instance, median
is often preferred over mean
for numerical data with outliers. most_frequent
or constant
(with a meaningful fill_value
) are necessary for categorical features if using SimpleImputer
.
In practice, you often need to apply different imputation strategies to different columns (e.g., median for numerical, most frequent for categorical). This is typically handled using Scikit-learn's ColumnTransformer
, which allows applying different transformers to different subsets of columns, often within a larger Pipeline. We will cover ColumnTransformer
and Pipelines in detail in Chapter 6. For now, understanding how SimpleImputer
works provides a solid foundation for handling missing data.
© 2025 ApX Machine Learning