Before you start filling in missing values, it's helpful to consider why they might be missing in the first place. Understanding the underlying mechanism can guide your choice of imputation strategy and help you anticipate potential biases introduced during data preparation. Statisticians Donald Rubin and Roderick Little classified missing data mechanisms into three main categories: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR).
Let's examine each one.
This is the simplest scenario. Data is considered MCAR if the probability that a value is missing is independent of both the observed values and the missing values themselves. In simpler terms, the missingness is purely random and doesn't have a systematic cause related to the data.
Think of it like this: imagine a survey where a few participants accidentally skipped a question because the page stuck together, or a lab sample was dropped and lost purely by chance. The reason for the missing data point has nothing to do with the participant's other answers or the value that was lost.
Characteristics:
Implications:
Formal Definition (Conceptual): The probability of an observation being missing (M=1) is independent of both the observed data (Yobs) and the potentially missing data (Ymiss). P(M=1∣Yobs,Ymiss)=P(M=1)
This is a more common and slightly more complex situation. Data is considered MAR if the probability that a value is missing depends only on other observed variables in the dataset, but not on the value of the missing variable itself (after controlling for the observed variables).
The name "Missing At Random" can be a bit misleading. It doesn't mean the missingness is purely random like MCAR. Instead, it means that given the information you have in other columns, the missingness is random.
Example: Consider a dataset with 'Income' and 'Years of Education'. Suppose men are less likely to report their income than women. If we only had the 'Income' column, this would look like MNAR (missingness depends on the unobserved gender). However, if we also have a 'Gender' column (which is fully observed), and the probability of missing income depends only on 'Gender' and not on the actual income level itself, then the data is MAR. The missingness in 'Income' can be predicted by the observed 'Gender' variable.
Another example: In a health study, patients with higher reported stress levels (observed) might be less likely to complete a follow-up blood pressure measurement (missing), but the missingness isn't directly related to the unmeasured blood pressure value itself, only to the observed stress level.
Characteristics:
Implications:
Formal Definition (Conceptual): The probability of an observation being missing depends only on the observed data (Yobs), not the missing data (Ymiss). P(M=1∣Yobs,Ymiss)=P(M=1∣Yobs)
This is the most challenging scenario. Data is MNAR if the probability that a value is missing depends on the missing value itself, or on other unobserved factors. The reason for the missingness is related to the unobserved value.
Examples:
Characteristics:
Implications:
Formal Definition (Conceptual): The probability of an observation being missing depends on the missing value (Ymiss), even after accounting for observed data (Yobs). P(M=1∣Yobs,Ymiss) depends on Ymiss
Dependency diagrams for MCAR, MAR, and MNAR. Arrows indicate that the probability of missingness depends on the source variable.
While you can rarely be 100% certain about the true missing data mechanism without specific knowledge of the data collection process, thinking about these categories is important:
In practice, MAR is often a reasonable working assumption when MCAR seems too simplistic and there's no strong domain reason to suspect MNAR. This assumption justifies using imputation methods that leverage relationships between variables. In the following sections, we will explore specific techniques for imputing missing values, keeping these underlying mechanisms in mind.
© 2025 ApX Machine Learning