Having loaded your data and performed an initial inspection of its structure, dimensions, and data types, the next practical step in cleaning is to address missing data. Real-world datasets are rarely complete, and understanding where and how data is missing is fundamental before you can decide on appropriate handling strategies. Missing values can skew statistical summaries, break certain algorithms, and lead to incorrect conclusions if ignored.
Pandas primarily uses the special floating-point value NaN
(Not a Number) to represent missing data. This value originates from NumPy and is standard across many data science tools in Python. It's important to recognize that NaN
behaves somewhat unusually. For instance, NaN
is not considered equal to itself (np.nan == np.nan
evaluates to False
).
Python's built-in None
object can also appear in DataFrames, typically in columns with an object
dtype. Pandas functions designed to detect missing data usually handle both NaN
and None
seamlessly. Occasionally, datasets might use custom placeholders for missing values (like '?', 'missing', 999, -1). These need to be converted to NaN
during or after loading for standard Pandas methods to work correctly (often using the na_values
parameter in pd.read_csv
or the .replace()
method).
Pandas provides straightforward methods to detect missing values. The .isnull()
method returns a boolean DataFrame of the same shape as the original, where True
indicates a missing value (NaN
or None
) and False
indicates a non-missing value.
import pandas as pd
import numpy as np
# Sample DataFrame (assuming it's loaded as 'df')
data = {'StudentID': [101, 102, 103, 104, 105],
'Score': [85, np.nan, 77, 92, 88],
'Grade': ['B', 'C', np.nan, 'A', 'B'],
'Attendance': [0.95, 0.80, 0.85, np.nan, 0.92]}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull())
Running df.isnull()
on our sample df
would produce:
StudentID Score Grade Attendance
0 False False False False
1 False True False False
2 False False True False
3 False False False True
4 False False False False
There is also an equivalent method, .isna()
, which performs the exact same function. Conversely, .notnull()
(and its equivalent .notna()
) return True
for non-missing values and False
for missing ones.
While the boolean DataFrame is informative, you usually need summaries. Applying the .sum()
method to the result of .isnull()
aggregates the True
values (which are treated as 1) column-wise, giving you the total count of missing values in each column.
# Count missing values per column
missing_counts = df.isnull().sum()
print(missing_counts)
This would output:
StudentID 0
Score 1
Grade 1
Attendance 1
dtype: int64
This immediately tells us that the 'Score', 'Grade', and 'Attendance' columns each have one missing value, while 'StudentID' has none.
To get the total number of missing values across the entire DataFrame, you can apply .sum()
again:
# Total count of missing values in the DataFrame
total_missing = df.isnull().sum().sum()
print(f"Total missing values in the DataFrame: {total_missing}")
# Output: Total missing values in the DataFrame: 3
Counts are useful, but understanding the proportion of missing data is often more insightful, especially in large datasets. A few missing values in a column with millions of entries might be negligible, whereas the same count in a column with only a hundred entries could be significant.
You can calculate the percentage of missing values per column by dividing the counts from .isnull().sum()
by the total number of rows (obtained using len(df)
or df.shape[0]
) and multiplying by 100.
# Calculate the percentage of missing values per column
total_rows = len(df)
missing_percentage = (df.isnull().sum() / total_rows) * 100
print(missing_percentage)
For our example df
:
StudentID 0.0
Score 20.0
Grade 20.0
Attendance 20.0
dtype: float64
This shows that 20% of the entries in the 'Score', 'Grade', and 'Attendance' columns are missing.
Visualizations can provide a quick overview of the extent of missing data, especially when dealing with many columns. A simple bar chart showing the count or percentage of missing values per column is effective.
Here's how you might create a bar chart of the missing percentages using Plotly:
# (Code to calculate missing_percentage shown above)
# Create a Plotly bar chart (replace with actual chart generation if needed)
# Example using the calculated 'missing_percentage' Series
Percentage of missing values for each column in the sample DataFrame. Columns with no missing data show 0%.
For more complex datasets, visualizing the pattern of missingness (e.g., using a heatmap from Seaborn like sns.heatmap(df.isnull(), cbar=False)
) can reveal if missing values tend to occur in the same rows across different columns, suggesting potential underlying relationships or systematic issues in data collection.
Identifying where and how much data is missing is the essential first step. With this knowledge, you are equipped to move on to the next section, which discusses strategies for handling these identified missing values, such as deleting them or filling them in using imputation techniques.
© 2025 ApX Machine Learning