Okay, you've identified the missing values (NaN
s) lurking in your dataset using methods from the previous section. Now comes the decision point: what should you do about them? Ignoring missing data is rarely an option, as most analytical algorithms and visualization tools don't handle NaN
values gracefully. Broadly, you have two main approaches: removing the missing data (deletion) or filling it in with estimated values (imputation). The choice isn't always obvious and depends heavily on the context, the amount of missing data, and your analytical goals.
The most straightforward approach is simply to remove rows or columns containing missing values.
This involves removing entire rows (observations) that contain any missing value in any column.
In Pandas, you can easily drop rows with any NaN
values using the .dropna()
method:
# Assuming 'df' is your DataFrame
df_cleaned_rows = df.dropna()
# Check the shape before and after
print(f"Original shape: {df.shape}")
print(f"Shape after dropping rows with NaN: {df_cleaned_rows.shape}")
Alternatively, if a particular column (feature) has a very high proportion of missing values, it might provide little useful information for your analysis. In such cases, deleting the entire column might be a reasonable option.
To drop columns with missing values in Pandas, specify axis=1
in the .dropna()
method. You can also use the thresh
parameter to keep columns that have at least a certain number of non-NaN
values.
# Drop columns where *all* values are NaN
df_cleaned_cols_all_nan = df.dropna(axis=1, how='all')
# Drop columns with more than 30% missing values
threshold = len(df) * 0.7 # Keep columns with at least 70% non-NaN values
df_cleaned_cols_thresh = df.dropna(axis=1, thresh=threshold)
print(f"Original shape: {df.shape}")
# print(f"Shape after dropping columns with all NaN: {df_cleaned_cols_all_nan.shape}")
print(f"Shape after dropping columns with >30% NaN: {df_cleaned_cols_thresh.shape}")
Imputation involves replacing missing values with substitute values. This preserves your sample size but introduces artificial data points, potentially affecting the data's original distribution and relationships. Simple imputation techniques are common during initial EDA.
Replace missing values in a numerical column with the mean of the observed values in that column.
# Impute missing values in 'numerical_col' with the mean
mean_value = df['numerical_col'].mean()
df['numerical_col_mean_imputed'] = df['numerical_col'].fillna(mean_value)
Replace missing values in a numerical column with the median of the observed values.
# Impute missing values in 'numerical_col' with the median
median_value = df['numerical_col'].median()
df['numerical_col_median_imputed'] = df['numerical_col'].fillna(median_value)
Replace missing values in a column with the mode (the most frequent value) of the observed values.
Pandas' .mode()
method returns a Series (because there might be multiple modes). Typically, you'll use the first mode ([0]
).
# Impute missing values in 'categorical_col' with the mode
mode_value = df['categorical_col'].mode()[0]
df['categorical_col_mode_imputed'] = df['categorical_col'].fillna(mode_value)
# Can also be applied to numerical columns, though less common
# numerical_mode_value = df['numerical_col'].mode()[0]
# df['numerical_col_mode_imputed'] = df['numerical_col'].fillna(numerical_mode_value)
Replace missing values with a predefined constant, such as 0, -1, "Unknown," or "Missing."
# Impute numerical NaNs with 0
df['numerical_col_zero_imputed'] = df['numerical_col'].fillna(0)
# Impute categorical NaNs with 'Unknown'
df['categorical_col_unknown_imputed'] = df['categorical_col'].fillna('Unknown')
So, should you delete or impute? There's no universal answer. Consider these factors:
Practical Recommendation:
Start by quantifying the extent of missing data per column and overall. If missing data is minimal, deletion might be the simplest path. If it's more prevalent, simple imputation (median for numerical, mode for categorical) is a reasonable starting point for initial EDA, allowing you to proceed with analysis while being mindful of the potential distortions introduced. Always document the strategy you choose, as it's an important step in your data preparation process. More complex imputation methods can be explored later if simple techniques prove inadequate for downstream tasks like modeling.
© 2025 ApX Machine Learning