While deleting rows or columns with missing data is sometimes necessary, it often comes at the cost of discarding valuable information. If a column has only a few missing entries, or if a row is missing just one value out of many, removing them entirely might significantly reduce the size and richness of your dataset. An alternative approach is imputation, which involves filling in the missing values with plausible substitutes.
Basic imputation strategies rely on using summary statistics derived from the non-missing values within the same column (feature). The idea is to replace the gap with a value that represents the "center" or "most typical" value for that feature. Let's look at the three most common methods: mean, median, and mode imputation.
The mean is simply the arithmetic average of a set of numbers. You calculate it by summing all the available values in a column and dividing by the count of those values.
Mean=Number of non-missing valuesSum of all non-missing valuesWhen to use it: Mean imputation is typically used for numerical columns (like height, temperature, or price) where the data doesn't have extreme outliers and follows a roughly symmetrical distribution (like a bell curve).
Example: Imagine a 'Temperature' column with values [25, 28, NaN, 30, 27]
.
25, 28, 30, 27
.NaN
with 27.5
. The column becomes [25, 28, 27.5, 30, 27]
.Consideration: The mean is sensitive to outliers. A single very high or very low value can significantly pull the mean in its direction, potentially making the imputed value less representative of the typical data point.
The median is the middle value in a dataset when it's sorted in ascending order. If there's an even number of values, the median is the average of the two middle values.
When to use it: Median imputation is also used for numerical columns. It's often preferred over the mean when the data contains significant outliers or is skewed (meaning it has a long tail on one side). The median is less affected by extreme values.
Example: Consider an 'Income' column with values [45000, 50000, NaN, 48000, 150000]
. The value 150000
is an outlier.
45000, 50000, 48000, 150000
.[45000, 48000, 50000, 150000]
.48000
and 50000
).NaN
with 49000
. The column becomes [45000, 50000, 49000, 48000, 150000]
.Notice how the median (49000
) is much closer to the bulk of the data (45000
, 48000
, 50000
) than the mean would be. The mean would be (45000+50000+48000+150000)/4=293000/4=73250, which is heavily influenced by the outlier.
The mode is the value that appears most frequently in a dataset. A dataset can have one mode, multiple modes (multimodal), or no mode if all values appear with the same frequency.
When to use it: Mode imputation is the standard choice for categorical columns (like 'Color', 'Country', 'Yes/No'). You cannot calculate a meaningful mean or median for non-numeric categories. It can sometimes be used for discrete numerical data where values represent counts or specific categories (e.g., number of doors on a car).
Example: Suppose we have a 'Shirt_Color' column: ['Blue', 'Red', 'Blue', NaN, 'Green', 'Red', 'Blue']
.
['Blue', 'Red', 'Blue', 'Green', 'Red', 'Blue']
.NaN
with 'Blue'. The column becomes ['Blue', 'Red', 'Blue', 'Blue', 'Green', 'Red', 'Blue']
.Frequency counts used to determine the mode ('Blue') for the 'Shirt_Color' example.
Consideration: If a categorical column has multiple modes (e.g., 'Blue' and 'Red' both appear 3 times), you might randomly choose one of the modes or use a more sophisticated imputation method. If there's no clear mode, this method might not be very informative.
Here’s a simple guideline:
While mean, median, and mode imputation are simple and fast, they have limitations:
These basic techniques are a starting point. More advanced methods (like regression imputation or K-Nearest Neighbors imputation) exist that try to predict missing values based on other columns, often providing more accurate results, but they are more complex and will be covered in more advanced material. For now, understanding and applying mean, median, and mode imputation provides a solid foundation for handling missing data in many common scenarios.
© 2025 ApX Machine Learning