Data is often incomplete, containing missing entries represented typically as NaN (Not a Number) in NumPy arrays or Pandas DataFrames. Many Scikit-learn algorithms cannot process datasets with missing values, making a strategy to handle them essential. Simply dropping rows or columns with missing data might discard valuable information, especially if the dataset isn't very large. A common alternative is imputation: replacing missing values with substitute values derived from the non-missing data.
Scikit-learn provides a convenient transformer for basic imputation tasks: SimpleImputer. Located in the sklearn.impute module, it allows you to fill missing values using various strategies.
SimpleImputerThe SimpleImputer follows the standard Scikit-learn transformer API, meaning it has fit and transform methods.
fit(X): During the fit step, the imputer calculates the statistic (e.g., mean, median, mode) from the non-missing values in each column of the training data X. This learned statistic is stored internally.transform(X): The transform step uses the statistics learned during fit to fill the missing values (marked as np.nan or another specified marker) in the input data X.It's important to fit the imputer only on the training data and then use it to transform both the training and the test data. This prevents data leakage, ensuring that information from the test set doesn't influence the imputation values used for the training set.
SimpleImputer supports several imputation strategies via the strategy parameter:
mean: Replaces missing values using the mean along each column. Suitable only for numerical data. Can be sensitive to outliers.median: Replaces missing values using the median along each column. Also only for numerical data. Generally more effective against outliers than the mean.most_frequent: Replaces missing values using the mode (the most frequent value) along each column. Can be used with both numerical and categorical data.constant: Replaces missing values with a fixed value specified by the fill_value parameter. This can be useful if missing values have a specific meaning or if you want to use a placeholder like 0 or -1 for numerical data, or "Missing" for categorical data.Let's see how to apply SimpleImputer with the median strategy.
import numpy as np
from sklearn.impute import SimpleImputer
# Sample data with missing values
X = np.array([[1.0, 2.0, np.nan],
[4.0, np.nan, 6.0],
[7.0, 8.0, 9.0],
[np.nan, 11.0, 12.0]])
print("Original Data:")
print(X)
# Initialize SimpleImputer with the 'median' strategy
imputer = SimpleImputer(strategy='median')
# Fit the imputer on the data (learns the medians of each column)
# Column 0 median: (1+4+7)/3 -> 4.0 (median of [1, 4, 7])
# Column 1 median: (2+8+11)/3 -> 8.0 (median of [2, 8, 11])
# Column 2 median: (6+9+12)/3 -> 9.0 (median of [6, 9, 12])
imputer.fit(X)
# Transform the data (fill missing values)
X_imputed = imputer.transform(X)
print("\nImputed Data (Median Strategy):")
print(X_imputed)
# We can also use fit_transform as a shortcut
imputer_mean = SimpleImputer(strategy='mean')
X_imputed_mean = imputer_mean.fit_transform(X)
print("\nImputed Data (Mean Strategy):")
print(X_imputed_mean)
Running this code will first print the original array containing NaN values. Then, it will show the array after applying median imputation, where NaNs are replaced by the respective column medians (4.0, 8.0, 9.0). Finally, it shows the result using the mean strategy.
Sometimes, the fact that a value was missing might itself be useful information for a machine learning model. SimpleImputer allows you to keep track of which values were imputed by setting the add_indicator parameter to True. When set, the transformer appends binary indicator columns to the output. Each indicator column corresponds to a column in the input that had missing values, with a 1 marking entries that were originally missing and 0 otherwise.
# Sample data again
X = np.array([[1.0, 2.0, np.nan],
[4.0, np.nan, 6.0],
[7.0, 8.0, 9.0],
[np.nan, 11.0, 12.0]])
# Initialize SimpleImputer with median strategy and indicator
indicator_imputer = SimpleImputer(strategy='median', add_indicator=True)
# Fit and transform
X_imputed_indicated = indicator_imputer.fit_transform(X)
print("\nImputed Data with Missing Value Indicators:")
print(X_imputed_indicated)
# Note the shape difference
print(f"\nOriginal shape: {X.shape}")
print(f"Shape after imputation with indicator: {X_imputed_indicated.shape}")
The output X_imputed_indicated will have the imputed values in the first three columns (same as X_imputed in the previous example) and three additional binary columns. The fourth column will have a 1 in row 3 (indicating X[3, 0] was missing). The fifth column will have a 1 in row 1 (indicating X[1, 1] was missing). The sixth column will have a 1 in row 0 (indicating X[0, 2] was missing).
While imputation is a standard technique, it's not without drawbacks. It makes assumptions about the data (e.g., that the median is a reasonable substitute) and can potentially distort relationships between variables or reduce variance. The choice of strategy (mean, median, most_frequent, constant) should be guided by the nature of your data and the algorithm you intend to use. For instance, median is often preferred over mean for numerical data with outliers. most_frequent or constant (with a meaningful fill_value) are necessary for categorical features if using SimpleImputer.
In practice, you often need to apply different imputation strategies to different columns (e.g., median for numerical, most frequent for categorical). This is typically handled using Scikit-learn's ColumnTransformer, which allows applying different transformers to different subsets of columns, often within a larger Pipeline. We will cover ColumnTransformer and Pipelines in detail in Chapter 6. For now, understanding how SimpleImputer works provides a solid foundation for handling missing data.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with