As we saw in the previous section, datasets often come with gaps, represented in Pandas as NaN
(Not a Number). One straightforward approach to dealing with these gaps is simply to remove the rows or columns that contain them. This is often a reasonable first step, especially if only a small fraction of your data is missing or if a particular row or column has so many missing values that it's not informative.
Pandas provides the dropna()
method for this purpose. Let's explore how it works.
By default, dropna()
removes entire rows if any value in that row is NaN
.
Consider this example DataFrame:
import pandas as pd
import numpy as np
data = {'col1': [1, 2, np.nan, 4, 5],
'col2': [np.nan, 7, 8, 9, 10],
'col3': [11, 12, 13, 14, np.nan],
'col4': ['A', 'B', 'C', 'D', 'E']}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
col1 col2 col3 col4
0 1.0 NaN 11.0 A
1 2.0 7.0 12.0 B
2 NaN 8.0 13.0 C
3 4.0 9.0 14.0 D
4 5.0 10.0 NaN E
Now, let's use dropna()
with its default settings:
df_dropped_rows = df.dropna() # Default is axis=0 (rows) and how='any'
print("\nDataFrame after dropping rows with any NaN:")
print(df_dropped_rows)
Output:
DataFrame after dropping rows with any NaN:
col1 col2 col3 col4
1 2.0 7.0 12.0 B
3 4.0 9.0 14.0 D
Notice that rows 0, 2, and 4 were removed because each contained at least one NaN
value. Only rows 1 and 3, which had complete data across all columns, were kept.
The dropna()
method has parameters to give you more control:
how
parameter:
how='any'
(default): Drop the row if any NaN
values are present.how='all'
: Drop the row only if all values in that row are NaN
.Let's create a DataFrame where one row is entirely NaN
:
data_with_all_nan = {'col1': [1, np.nan, np.nan, 4],
'col2': [np.nan, 7, np.nan, 9],
'col3': [11, 12, np.nan, 14]}
df_all_nan = pd.DataFrame(data_with_all_nan)
print("\nOriginal DataFrame with an all-NaN row possibility:")
print(df_all_nan)
df_dropped_all = df_all_nan.dropna(how='all')
print("\nDataFrame after dropping rows with all NaN:")
print(df_dropped_all)
Output:
Original DataFrame with an all-NaN row possibility:
col1 col2 col3
0 1.0 NaN 11.0
1 NaN 7.0 12.0
2 NaN NaN NaN
3 4.0 9.0 14.0
DataFrame after dropping rows with all NaN:
col1 col2 col3
0 1.0 NaN 11.0
1 NaN 7.0 12.0
3 4.0 9.0 14.0
In this case, only row 2, where all values were NaN
, was dropped when using how='all'
.
thresh
parameter: This lets you specify a minimum number of non-missing values required for a row to be kept. For example, thresh=3
means a row will be kept only if it has at least 3 valid (non-NaN) values.
Using our original df
:
# Keep rows with at least 3 non-NaN values
df_thresh3 = df.dropna(thresh=3)
print("\nDataFrame keeping rows with at least 3 non-NaN values:")
print(df_thresh3)
Output:
DataFrame keeping rows with at least 3 non-NaN values:
col1 col2 col3 col4
0 1.0 NaN 11.0 A # Kept (3 non-NaN)
1 2.0 7.0 12.0 B # Kept (4 non-NaN)
2 NaN 8.0 13.0 C # Kept (3 non-NaN)
3 4.0 9.0 14.0 D # Kept (4 non-NaN)
4 5.0 10.0 NaN E # Kept (3 non-NaN)
Here, all rows were kept because each had at least 3 non-missing values. If we increased the threshold:
# Keep rows with at least 4 non-NaN values
df_thresh4 = df.dropna(thresh=4)
print("\nDataFrame keeping rows with at least 4 non-NaN values:")
print(df_thresh4)
Output:
DataFrame keeping rows with at least 4 non-NaN values:
col1 col2 col3 col4
1 2.0 7.0 12.0 B
3 4.0 9.0 14.0 D
Now, only rows 1 and 3 are kept, as they are the only ones with 4 valid values.
Sometimes, you might want to remove entire columns if they contain missing data, especially if a column has many NaN
s or is not essential for your analysis. You can do this by setting the axis
parameter to 1
(or 'columns'
).
# Drop columns containing any NaN values
df_dropped_cols = df.dropna(axis=1) # axis=1 targets columns
print("\nDataFrame after dropping columns with any NaN:")
print(df_dropped_cols)
Output:
DataFrame after dropping columns with any NaN:
col4
0 A
1 B
2 C
3 D
4 E
In our example df
, columns col1
, col2
, and col3
all contained at least one NaN
, so they were dropped. Only col4
, which had no missing values, remained.
The how
and thresh
parameters work similarly when applied to columns:
df.dropna(axis=1, how='all')
would drop columns only if all their values are NaN
.df.dropna(axis=1, thresh=4)
would keep columns only if they have at least 4 non-NaN
values.# Keep columns with at least 4 non-NaN values
df_thresh4_cols = df.dropna(axis=1, thresh=4)
print("\nDataFrame keeping columns with at least 4 non-NaN values:")
print(df_thresh4_cols)
Output:
DataFrame keeping columns with at least 4 non-NaN values:
col1 col2 col3 col4
0 1.0 NaN 11.0 A
1 2.0 7.0 12.0 B
2 NaN 8.0 13.0 C
3 4.0 9.0 14.0 D
4 5.0 10.0 NaN E
In this case, col1
, col2
, and col3
each have 4 non-NaN values (out of 5 total rows), and col4
has 5. Since all meet the threshold of 4, no columns are dropped.
By default, dropna()
returns a new DataFrame with the missing values dropped, leaving the original DataFrame unchanged. If you want to modify the original DataFrame directly, you can use the inplace=True
parameter.
df_copy = df.copy() # Make a copy to modify
print("\nDataFrame before inplace drop:")
print(df_copy)
df_copy.dropna(inplace=True) # Modifies df_copy directly
print("\nDataFrame after inplace drop:")
print(df_copy)
Output:
DataFrame before inplace drop:
col1 col2 col3 col4
0 1.0 NaN 11.0 A
1 2.0 7.0 12.0 B
2 NaN 8.0 13.0 C
3 4.0 9.0 14.0 D
4 5.0 10.0 NaN E
DataFrame after inplace drop:
col1 col2 col3 col4
1 2.0 7.0 12.0 B
3 4.0 9.0 14.0 D
Use inplace=True
with caution. Since it modifies your data directly, it's often safer to assign the result to a new variable unless you are certain you no longer need the original data with the NaN
values.
Dropping missing data is simple, but it comes at a cost: you lose information.
This strategy is generally most suitable when:
Always consider the potential impact of removing data before doing so. If dropping seems too drastic, the next section explores an alternative: filling in the missing values.
© 2025 ApX Machine Learning