As highlighted in the chapter introduction, real-world datasets are rarely complete. They often contain gaps, represented as missing values. Before you can analyze data or train a model, you need to know where these gaps are and how extensive they might be. Pandas provides straightforward tools to detect missing data, which is conventionally marked using the special floating-point value NaN (Not a Number). Python's None
object is also treated as missing data in Pandas objects.
isnull()
and notnull()
Pandas offers two primary methods for detecting missing values:
isnull()
: Returns a boolean object (Series or DataFrame) of the same size as the original, where True
indicates a missing value (NaN or None
) and False
indicates a non-missing value.notnull()
: The inverse of isnull()
. It returns True
for non-missing values and False
for missing values.Let's see these in action. First, we'll need pandas and numpy imported.
import pandas as pd
import numpy as np
Now, let's create a simple Pandas Series containing some missing data represented by np.nan
:
# Create a Series with missing values
data_series = pd.Series([1, np.nan, 3.5, np.nan, 7])
print("Original Series:")
print(data_series)
Original Series:
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
Now, we can use isnull()
to create a boolean mask identifying the locations of the NaN values:
# Detect missing values
missing_mask = data_series.isnull()
print("\nBoolean mask from isnull():")
print(missing_mask)
Boolean mask from isnull():
0 False
1 True
2 False
3 True
4 False
dtype: bool
As you can see, the resulting Series contains True
at indices 1 and 3, corresponding to the NaN values in the original data_series
.
Conversely, notnull()
identifies the non-missing values:
# Detect non-missing values
not_missing_mask = data_series.notnull()
print("\nBoolean mask from notnull():")
print(not_missing_mask)
Boolean mask from notnull():
0 True
1 False
2 True
3 False
4 True
dtype: bool
This returns True
where the data exists and False
where it's missing.
(Note: You might also encounter the aliases isna()
for isnull()
and notna()
for notnull()
. They perform the exact same function.)
These methods work similarly on DataFrames, but they return a boolean DataFrame instead of a Series.
Let's create a DataFrame with missing values:
# Create a DataFrame with missing values
data = {'col_a': [1, 2, np.nan, 4, 5],
'col_b': [np.nan, 7, 8, np.nan, 10],
'col_c': [11, 12, 13, 14, 15],
'col_d': ['apple', 'banana', 'orange', np.nan, 'grape']}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Original DataFrame:
col_a col_b col_c col_d
0 1.0 NaN 11 apple
1 2.0 7.0 12 banana
2 NaN 8.0 13 orange
3 4.0 NaN 14 NaN
4 5.0 10.0 15 grape
Applying isnull()
to this DataFrame gives us:
# Detect missing values in the DataFrame
missing_df_mask = df.isnull()
print("\nBoolean mask DataFrame from isnull():")
print(missing_df_mask)
Boolean mask DataFrame from isnull():
col_a col_b col_c col_d
0 False True False False
1 False False False False
2 True False False False
3 False True False True
4 False False False False
This boolean DataFrame directly maps the locations of missing values within the original df
.
While seeing the exact location of missing values is useful, you often need a summary. How many missing values are there in total, or per column? You can easily achieve this by summing the results of isnull()
, because in numerical contexts, True
is treated as 1 and False
as 0.
To count missing values in each column:
# Count missing values per column
missing_counts_per_column = df.isnull().sum()
print("\nMissing value counts per column:")
print(missing_counts_per_column)
Missing value counts per column:
col_a 1
col_b 2
col_c 0
col_d 1
dtype: int64
This is a very common operation. It quickly tells you that col_a
has one missing value, col_b
has two, col_c
has none, and col_d
has one.
To get the total number of missing values in the entire DataFrame, you can sum the results twice:
# Count total missing values in the DataFrame
total_missing_count = df.isnull().sum().sum()
print(f"\nTotal missing values in the DataFrame: {total_missing_count}")
Total missing values in the DataFrame: 4
Detecting where and how much data is missing is the essential first step in the data cleaning process. Once you've identified these gaps using methods like isnull()
and sum()
, you can move on to deciding how to handle them, which is the focus of the next sections.
© 2025 ApX Machine Learning