Real-world datasets are often incomplete. Entries might be missing because data wasn't collected, was lost during processing, or is simply not applicable for certain observations. In Pandas, these missing values are typically represented by the special floating-point value NaN
(Not a Number). Ignoring missing data can lead to incorrect calculations, biased analyses, and poorly performing machine learning models. Therefore, identifying and appropriately handling missing values is a fundamental step in data preparation.
Pandas provides straightforward methods to detect missing values. The isnull()
method (and its alias isna()
) returns a boolean DataFrame or Series of the same shape as the original, indicating True
where data is missing (NaN
) and False
otherwise. Conversely, notnull()
returns True
for non-missing values.
import pandas as pd
import numpy as np
# Sample DataFrame with missing values
data = {'col1': [1, 2, np.nan, 4, 5],
'col2': [np.nan, 7, 8, 9, 10],
'col3': [11, 12, 13, np.nan, np.nan],
'col4': ['A', 'B', 'C', 'D', np.nan]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\nCheck for missing values (isnull()):")
print(df.isnull())
print("\nCheck for non-missing values (notnull()):")
print(df.notnull())
While looking at the boolean mask is useful for small datasets, it's often more practical to get a count of missing values per column. You can achieve this by chaining the sum()
method after isnull()
, as True
values are treated as 1 and False
as 0 during summation.
# Count missing values per column
print("\nMissing values count per column:")
print(df.isnull().sum())
# Total number of missing values in the DataFrame
print("\nTotal missing values in DataFrame:")
print(df.isnull().sum().sum())
This summary quickly tells you which columns have missing data and how much.
There are two primary strategies for dealing with NaN
values: removing them or replacing them (imputation). The choice depends on the context, the amount of missing data, and the potential impact on your analysis or model.
The simplest approach is to remove rows or columns containing NaN
values using the dropna()
method.
Dropping Rows: By default, dropna()
removes any row containing at least one NaN
value. This is often suitable if only a small fraction of rows have missing data and removing them doesn't introduce significant bias.
# Drop rows with any NaN values (default behavior)
df_dropped_rows = df.dropna()
print("\nDataFrame after dropping rows with any NaN:")
print(df_dropped_rows)
Dropping Columns: You can drop entire columns if they contain missing values by setting the axis
parameter to 1 (or 'columns'). This might be useful if a column has a very high proportion of missing values or is deemed unimportant for the analysis.
# Drop columns with any NaN values
df_dropped_cols = df.dropna(axis=1)
print("\nDataFrame after dropping columns with any NaN:")
print(df_dropped_cols)
Controlling Drop Behavior: The how
parameter modifies this behavior:
how='any'
: Drop the row/column if any NaN
values are present (default).how='all'
: Drop the row/column only if all values are NaN
.The thresh
parameter provides finer control: it specifies the minimum number of non-missing values required for a row/column to be kept.
# Keep rows with at least 3 non-NaN values
df_thresh = df.dropna(thresh=3)
print("\nDataFrame keeping rows with at least 3 non-NaN values:")
print(df_thresh)
# Drop rows where all values are NaN (useful, though not applicable here)
df_all_nan = df.dropna(how='all')
# print(df_all_nan) # Output would be same as original df in this case
Caution: Dropping data, especially rows, leads to information loss. If missing values are not randomly distributed, dropping them can bias your remaining dataset. Always consider the implications before removing data.
Instead of removing data, you can replace NaN
values with substitutes. This is known as imputation. The fillna()
method is used for this purpose.
Filling with a Constant Value: You can replace all NaN
s with a specific value, such as 0, "Unknown", or a value that makes sense in the context of the feature.
# Fill all NaN with 0
df_filled_zero = df.fillna(0)
print("\nDataFrame after filling NaN with 0:")
print(df_filled_zero)
# Fill NaN in specific columns differently
fill_values = {'col1': df['col1'].mean(), 'col2': 0, 'col3': -1, 'col4': 'Unknown'}
df_filled_specific = df.fillna(value=fill_values)
print("\nDataFrame after filling NaN with specific values per column:")
print(df_filled_specific)
Forward Fill and Backward Fill: For ordered data like time series, it can be appropriate to propagate the last known valid observation forward (method='ffill'
) or the next known valid observation backward (method='bfill'
).
# Forward fill
df_ffill = df.fillna(method='ffill')
print("\nDataFrame after forward fill:")
print(df_ffill)
# Backward fill
df_bfill = df.fillna(method='bfill')
print("\nDataFrame after backward fill:")
print(df_bfill)
Filling with Statistical Measures: A common technique for numerical data is to fill NaN
s with the column's mean, median, or mode. The median is often preferred over the mean when the data has outliers, as it's less sensitive to extreme values. The mode can be used for categorical features.
# Fill NaN in col1 with the mean of col1
mean_col1 = df['col1'].mean()
df_filled_mean = df.copy() # Create a copy to avoid modifying original df implicitly
df_filled_mean['col1'].fillna(mean_col1, inplace=True)
print("\nDataFrame after filling NaN in col1 with its mean:")
print(df_filled_mean)
# Fill NaN in col3 with the median of col3
median_col3 = df['col3'].median()
df_filled_median = df.copy()
df_filled_median['col3'].fillna(median_col3, inplace=True)
print("\nDataFrame after filling NaN in col3 with its median:")
print(df_filled_median)
Important Note: When preparing data for machine learning, you should calculate statistical measures (like mean or median) only on the training dataset and then use those values to fill missing data in both the training and testing datasets. This prevents information leakage from the test set into the training process. We will cover data splitting in Chapter 5.
Group-Specific Imputation: Sometimes, a global mean or median isn't representative. For instance, if filling missing salaries, the average salary might differ significantly based on job role. You can use groupby()
combined with transform()
to fill missing values based on group-level statistics.
# Example: Fill NaN based on group means
data_groups = {'Group': ['A', 'A', 'B', 'B', 'A', 'B'],
'Value': [10, np.nan, 20, 22, 12, np.nan]}
df_groups = pd.DataFrame(data_groups)
# Calculate group means and fill NaN within each group
df_groups['Value_filled'] = df_groups.groupby('Group')['Value'] \
.transform(lambda x: x.fillna(x.mean()))
print("\nDataFrame with group-specific mean imputation:")
print(df_groups)
transform()
applies a function (here, filling NaN with the group mean) to each group and returns a Series aligned with the original DataFrame's index.
The best method for handling missing data depends on:
Handling missing data is often an iterative process. You might try different strategies and evaluate their impact on subsequent analysis or model performance. Pandas provides the flexible tools needed to implement these various approaches effectively.
© 2025 ApX Machine Learning