Real-world datasets are often incomplete. You might find gaps where data should be, represented perhaps as NaN
(Not a Number), null
, an empty cell, or sometimes a specific placeholder value like 999
or -1
. These missing values can cause problems for many analysis tools and statistical methods, which often expect complete data. Ignoring them can lead to errors or, worse, biased and inaccurate results. Therefore, addressing missing data is a fundamental step in data preparation.
Data can be missing for various reasons:
Understanding why data might be missing can sometimes inform the best strategy for handling it, though often the exact reason is unknown.
There are two primary approaches to dealing with missing values: Deletion and Imputation.
This involves removing the data points or features (columns) that contain missing values.
Listwise Deletion (Row Removal): The most straightforward method is to remove any row that contains at least one missing value.
Column (Feature) Removal: If a specific column has a very high percentage of missing values (e.g., more than 50-60%), it might provide little useful information. In such cases, you might decide to remove the entire column.
Imputation involves replacing missing values with substitute values. The goal is to estimate a reasonable replacement based on the available information.
Mean/Median/Mode Imputation: These are simple statistical imputation methods.
Mean Imputation: Replace missing numerical values with the average (mean) of the non-missing values in that column. Best suited for numerical data that is roughly symmetrically distributed (without extreme outliers).
Median Imputation: Replace missing numerical values with the middle value (median) of the non-missing values in that column. More robust to outliers than the mean, making it a better choice for skewed numerical data.
Mode Imputation: Replace missing categorical (or sometimes discrete numerical) values with the most frequent value (mode) in that column. This is the standard approach for non-numerical data.
Pros: Simple to implement. Retains the full dataset size (no row deletion).
Cons: Reduces the variance (spread) of the data in the imputed column. Distorts relationships (like correlation) between variables because it assumes the imputed value is independent of other features for that observation. Does not account for uncertainty.
Let's consider a small example. Imagine a dataset with ages: [25, 30, ?, 35, 28, ?, 40]
.
?
with 31.6
.[25, 28, 30, 35, 40]
. The median is 30
. Median imputation would replace ?
with 30
.Other Imputation Techniques (Brief Mention): More advanced techniques exist, such as filling missing values based on other features (e.g., using regression) or finding similar data points (e.g., k-Nearest Neighbors imputation). These methods often provide better estimates but are more complex and beyond the scope of this introductory section.
Before deciding on a strategy, it's often helpful to visualize the extent and pattern of missingness. A simple bar chart showing the percentage of missing values per column is a common starting point.
This chart shows that 'Last Purchase Date' has a high percentage of missing values, while 'City' has none. 'Income' also has a noticeable amount missing. This visualization helps prioritize which columns need attention.
There is no single best way to handle missing values. The choice depends on:
It's important practice to document how you handled missing values, as this decision can influence the final results of your analysis. Start simple, often with median imputation for numerical features and mode imputation for categorical features, and be mindful of the potential drawbacks.
© 2025 ApX Machine Learning