Before we can address missing data, we first need to find out where it exists in our dataset and how much of it there is. Just knowing that data is missing isn't enough. We need systematic ways to pinpoint the exact locations and quantify the extent of the problem. This usually involves using programming tools to inspect our data structure, typically a table or DataFrame.
Let's assume we are working with data loaded into a pandas DataFrame, a common structure used in Python for data analysis. Pandas provides convenient functions specifically designed for detecting missing values, which it usually recognizes as NaN
(Not a Number) or None
.
The most direct way to check for missing entries is using the isnull()
method (or its identical alias isna()
). When applied to a pandas DataFrame or Series (a single column), it doesn't change the data itself. Instead, it returns a new object of the same shape, filled with boolean values: True
indicates that the original value at that position was missing, and False
indicates it was present.
import pandas as pd
import numpy as np
# Sample data with missing values
data = {'Product ID': [101, 102, 103, 104, 105],
'Category': ['Electronics', 'Apparel', np.nan, 'Electronics', 'Home Goods'],
'Price': [1200, 55, 250, np.nan, 80],
'Rating': [4.5, np.nan, 3.8, 4.1, np.nan]}
df = pd.DataFrame(data)
# Detect missing values
missing_mask = df.isnull()
print(missing_mask)
# Output:
# Product ID Category Price Rating
# 0 False False False False
# 1 False False False True
# 2 False True False False
# 3 False False True False
# 4 False False False True
This boolean mask is useful for seeing the exact location of every missing value, but for larger datasets, looking at a wall of True
/False
values isn't very practical for summarizing the problem.
A more common task is to count how many missing values exist within each column. We can achieve this by chaining the .sum()
method after .isnull()
. When you sum boolean values, True
is treated as 1 and False
as 0. Applying .sum()
to the boolean DataFrame generated by isnull()
effectively counts the number of True
values (missing entries) in each column.
# Count missing values per column
missing_counts = df.isnull().sum()
print(missing_counts)
# Output:
# Product ID 0
# Category 1
# Price 1
# Rating 2
# dtype: int64
This output is a pandas Series where the index represents the column names and the values represent the count of missing entries in those columns. This immediately tells us that 'Product ID' has no missing values, 'Category' and 'Price' each have one, and 'Rating' has two.
To get a single number representing the total count of missing values across the entire DataFrame, you can apply .sum()
again:
# Total number of missing values in the entire DataFrame
total_missing = df.isnull().sum().sum()
print(f"Total missing values in the dataset: {total_missing}")
# Output:
# Total missing values in the dataset: 4
Counts are helpful, but understanding the proportion of missing data relative to the total size of each column often gives better context. A column missing 10 values is less concerning in a dataset with 10,000 rows than in one with only 50 rows.
We can calculate the percentage of missing values per column like this:
Percentage Missing=Total Number of RowsNumber of Missing Values×100In pandas, this translates to:
# Calculate the percentage of missing values per column
total_rows = len(df)
missing_percentage = (df.isnull().sum() / total_rows) * 100
print(missing_percentage)
# Output:
# Product ID 0.0
# Category 20.0
# Price 20.0
# Rating 40.0
# dtype: float64
This shows that 20% of the 'Category' and 'Price' entries are missing, and a significant 40% of the 'Rating' entries are missing. Percentages help prioritize which columns might need more attention during the cleaning process.
While numerical summaries are essential, sometimes a visual representation can make the distribution of missing data across columns more immediately apparent. A simple bar chart showing the count of missing values per column is often effective.
This bar chart visualizes the counts derived from
df.isnull().sum()
, making it easy to compare the extent of missing data across different features.
These programmatic methods, combining boolean checks, counts, and percentages, form the foundation for understanding the scope of missing data in your dataset. Once you've identified where and how much data is missing, you can move on to choosing and applying appropriate strategies to handle it, which we will cover in the subsequent sections.
© 2025 ApX Machine Learning