Let's put the theory into practice. Knowing how to identify and handle missing data is fundamental, and the best way to learn is by doing. We'll use Python with the popular Pandas library, a standard tool in data science, to work through a small example dataset. If you haven't used Pandas before, don't worry. We'll guide you step-by-step.
First, imagine we have collected some data about students, but some information is missing. Let's create this data in a Pandas DataFrame. We use numpy.nan
to represent the missing values.
import pandas as pd
import numpy as np
# Sample data with missing values
data = {
'StudentID': [101, 102, 103, 104, 105, 106],
'Score': [85, np.nan, 70, 95, 88, 78],
'Grade': ['B', 'C', np.nan, 'A', 'B', np.nan],
'Attendance': [90, 80, 85, np.nan, 95, 88]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
This code creates a DataFrame named df
. Printing it shows us the data, including the NaN
entries representing missing information.
As discussed earlier, the first step is to find out where and how much data is missing. We can use the isnull()
method combined with sum()
in Pandas. isnull()
returns a DataFrame of boolean values (True if missing, False otherwise), and sum()
then counts the True
values per column.
# Count missing values in each column
missing_counts = df.isnull().sum()
print("\nMissing Value Counts:")
print(missing_counts)
The output will show:
StudentID 0
Score 1
Grade 2
Attendance 1
dtype: int64
This tells us we have one missing Score
, two missing Grade
entries, and one missing Attendance
percentage. StudentID
has no missing values.
Sometimes, a visual inspection helps understand the extent of missing data, especially in larger datasets. A simple bar chart showing the count of missing values per column can be effective.
Bar chart displaying the number of missing entries found in the 'Score', 'Grade', and 'Attendance' columns.
This visualization confirms our findings from isnull().sum()
: 'Grade' has the most missing values in this small dataset.
One way to handle missing data is to remove any row containing at least one missing value. This is called listwise deletion. Pandas provides the dropna()
method for this.
# Create a copy to avoid modifying the original DataFrame
df_dropped_rows = df.copy()
# Drop rows with any NaN values
df_dropped_rows.dropna(inplace=True) # inplace=True modifies the DataFrame directly
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped_rows)
print(f"\nOriginal shape: {df.shape}")
print(f"Shape after dropping rows: {df_dropped_rows.shape}")
You'll notice that rows with index 1, 2, and 3 (corresponding to StudentIDs 102, 103, and 104) have been removed because they each had at least one NaN
. The resulting DataFrame df_dropped_rows
only contains complete records.
Consideration: This approach is simple but can lead to significant data loss, especially if missing values are widespread. Losing three out of six students (50% of the data) is substantial. This is generally only advisable if the number of rows with missing data is very small compared to the total dataset size.
If a column has a very large proportion of missing values, it might provide little useful information and could be dropped entirely. We can use dropna()
again, but specify axis=1
to target columns. We can also set a thresh
(threshold) parameter, which specifies the minimum number of non-missing values required for a column to be kept.
Let's say we decide to drop any column that has fewer than 5 non-missing values (our DataFrame has 6 rows total).
# Create another copy
df_dropped_cols = df.copy()
# Drop columns with less than 5 non-missing values
threshold = 5
df_dropped_cols.dropna(axis=1, thresh=threshold, inplace=True)
print("\nDataFrame after dropping columns with many missing values:")
print(df_dropped_cols)
print(f"\nOriginal shape: {df.shape}")
print(f"Shape after dropping columns: {df_dropped_cols.shape}")
In our example, 'Grade' has only 4 non-missing values (6 total rows - 2 missing = 4). Since 4 is less than our threshold of 5, the 'Grade' column is removed. The 'Score' and 'Attendance' columns each have 5 non-missing values, meeting the threshold, so they are kept.
Consideration: Removing columns means losing that feature entirely. Only do this if the column is deemed non-essential or if the proportion of missing data is so high that imputation would be unreliable.
Instead of deleting data, we can fill the gaps. This is called imputation. Simple methods involve replacing missing values with the mean, median, or mode of the respective column.
Let's impute the missing values in our original df
.
# Create a copy for imputation
df_imputed = df.copy()
# 1. Impute 'Score' (numerical) with the median
median_score = df_imputed['Score'].median()
print(f"\nMedian score for imputation: {median_score}")
df_imputed['Score'].fillna(median_score, inplace=True)
# 2. Impute 'Attendance' (numerical) with the mean
mean_attendance = df_imputed['Attendance'].mean()
print(f"Mean attendance for imputation: {mean_attendance:.2f}") # Format to 2 decimal places
df_imputed['Attendance'].fillna(mean_attendance, inplace=True)
# 3. Impute 'Grade' (categorical) with the mode
# Note: mode() can return multiple values if they have the same frequency. We take the first one [0].
mode_grade = df_imputed['Grade'].mode()[0]
print(f"Mode grade for imputation: {mode_grade}")
df_imputed['Grade'].fillna(mode_grade, inplace=True)
print("\nDataFrame after imputation:")
print(df_imputed)
# Verify that no missing values remain
print("\nMissing values count after imputation:")
print(df_imputed.isnull().sum())
The output shows the calculated median score, mean attendance, and mode grade. The final imputed DataFrame has all NaN
values replaced:
Score
(for StudentID 102) is replaced by the median score (which is 86.5, calculated from [85, 70, 95, 88, 78]).Attendance
(for StudentID 104) is replaced by the mean attendance (which is 87.6, calculated from [90, 80, 85, 95, 88]).Grade
values (for StudentIDs 103 and 106) are replaced by the mode grade ('B', as it appears most frequently).Finally, checking isnull().sum()
confirms that there are no missing values left in df_imputed
.
Consideration: Simple imputation preserves the dataset size but can distort relationships between variables and reduce variance (especially mean imputation). Using the median for skewed numerical data and the mode for categorical data are common starting points.
As you've seen, there's no single "best" way to handle missing data. The choice depends on:
In this practical exercise, you've applied the fundamental techniques: detecting, visualizing, deleting (rows/columns), and basic imputing (mean/median/mode). These methods form the foundation for preparing data for analysis. As you progress, you'll encounter more advanced imputation techniques, but mastering these basics is an essential first step.
© 2025 ApX Machine Learning