While dropping rows or columns with missing data using dropna()
is straightforward, it's often not the best approach. Removing data means losing potentially valuable information, especially if missing values are sparse or if the dataset is small. A common alternative is to fill the missing values, also known as imputation.
The primary tool for this in Pandas is the fillna()
method. It provides flexible options for replacing NaN
(Not a Number) values within a Series or DataFrame.
fillna()
MethodLet's start with a simple DataFrame containing missing values:
import pandas as pd
import numpy as np
data = {'col_a': [1, np.nan, 3, 4, np.nan],
'col_b': [np.nan, 6, 7, 8, 9],
'col_c': ['apple', 'banana', np.nan, 'orange', 'banana']}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\nMissing values per column:")
print(df.isnull().sum())
Running this code will show our DataFrame and the count of missing values in each column:
Original DataFrame:
col_a col_b col_c
0 1.0 NaN apple
1 NaN 6.0 banana
2 3.0 7.0 NaN
3 4.0 8.0 orange
4 NaN 9.0 banana
Missing values per column:
col_a 2
col_b 1
col_c 1
dtype: int64
The most basic strategy is to replace all NaN
values with a fixed value. The choice of value depends on the context. For numerical columns, 0 is a common choice, while for categorical columns, you might use a placeholder like "Unknown" or "Missing".
To fill all NaN
s in the entire DataFrame with 0:
df_filled_zero = df.fillna(0)
print("\nDataFrame after filling NaN with 0:")
print(df_filled_zero)
Output:
DataFrame after filling NaN with 0:
col_a col_b col_c
0 1.0 0.0 apple
1 0.0 6.0 banana
2 3.0 7.0 0
3 4.0 8.0 orange
4 0.0 9.0 banana
Notice that fillna(0)
replaced the NaN
in col_c
(a string column) with the integer 0. This might not be ideal. Often, you'll want to apply different filling strategies to different columns. We'll see how to do that later.
Instead of a constant, you can fill missing numerical values with a statistic calculated from the available data in that column, such as the mean or median. For categorical data, the mode (most frequent value) is often used.
Let's fill col_a
with its mean and col_b
with its median. First, we calculate these values (remembering to skip NaN
s in the calculation, which Pandas does by default):
mean_a = df['col_a'].mean()
median_b = df['col_b'].median()
print(f"\nMean of col_a: {mean_a}")
print(f"Median of col_b: {median_b}")
# Fill col_a using its mean
df['col_a_filled_mean'] = df['col_a'].fillna(mean_a)
# Fill col_b using its median
df['col_b_filled_median'] = df['col_b'].fillna(median_b)
print("\nDataFrame with mean/median filled columns:")
print(df[['col_a', 'col_a_filled_mean', 'col_b', 'col_b_filled_median']])
Output:
Mean of col_a: 2.6666666666666665
Median of col_b: 7.5
DataFrame with mean/median filled columns:
col_a col_a_filled_mean col_b col_b_filled_median
0 1.0 1.000000 NaN 7.5
1 NaN 2.666667 6.0 6.0
2 3.0 3.000000 7.0 7.0
3 4.0 4.000000 8.0 8.0
4 NaN 2.666667 9.0 9.0
To fill col_c
(categorical) with its mode:
mode_c = df['col_c'].mode()[0] # mode() returns a Series, get the first element
print(f"\nMode of col_c: {mode_c}")
df['col_c_filled_mode'] = df['col_c'].fillna(mode_c)
print("\nDataFrame with mode filled column:")
print(df[['col_c', 'col_c_filled_mode']])
Output:
Mode of col_c: banana
DataFrame with mode filled column:
col_c col_c_filled_mode
0 apple apple
1 banana banana
2 NaN banana
3 orange orange
4 banana banana
Sometimes, especially with time-series data or ordered data, it makes sense to fill a missing value based on the value that came immediately before or after it.
ffill
): Propagates the last valid observation forward to fill the gap.bfill
): Uses the next valid observation to fill the gap.You can specify these methods using the method
argument in fillna()
:
# Create a DataFrame with more NaNs to see propagation
data_seq = {'value': [10, np.nan, np.nan, 13, np.nan, 15, np.nan, np.nan, np.nan, 20]}
df_seq = pd.DataFrame(data_seq)
print("\nSequential DataFrame:")
print(df_seq)
# Apply forward fill
df_seq['ffill'] = df_seq['value'].fillna(method='ffill')
# Apply backward fill
df_seq['bfill'] = df_seq['value'].fillna(method='bfill')
print("\nDataFrame after ffill and bfill:")
print(df_seq)
Output:
Sequential DataFrame:
value
0 10.0
1 NaN
2 NaN
3 13.0
4 NaN
5 15.0
6 NaN
7 NaN
8 NaN
9 20.0
DataFrame after ffill and bfill:
value ffill bfill
0 10.0 10.0 10.0
1 NaN 10.0 13.0
2 NaN 10.0 13.0
3 13.0 13.0 13.0
4 NaN 13.0 15.0
5 15.0 15.0 15.0
6 NaN 15.0 20.0
7 NaN 15.0 20.0
8 NaN 15.0 20.0
9 20.0 20.0 20.0
Observe how ffill
carries the last seen value (10
, 13
, 15
) forward, while bfill
fills gaps with the next available value (13
, 15
, 20
).
You often need to apply distinct filling strategies to different columns. You can achieve this by passing a dictionary to fillna()
, where keys are column names and values are the corresponding fill values or strategies (though methods like ffill
/bfill
apply to the whole DataFrame when used this way, it's more common to apply them column by column or use dictionary values for constants/stats).
A more typical approach for different strategies is to fill column by column or use a dictionary for constant values:
# Reset original DataFrame
df = pd.DataFrame(data)
fill_values = {'col_a': df['col_a'].mean(),
'col_b': 0, # Fill col_b NaN with 0
'col_c': 'Unknown'} # Fill col_c NaN with 'Unknown'
df_filled_dict = df.fillna(value=fill_values)
print("\nDataFrame filled using a dictionary:")
print(df_filled_dict)
Output:
DataFrame filled using a dictionary:
col_a col_b col_c
0 1.000000 0.0 apple
1 2.666667 6.0 banana
2 3.000000 7.0 Unknown
3 4.000000 8.0 orange
4 2.666667 9.0 banana
Like many Pandas methods, fillna()
returns a new DataFrame with the changes by default, leaving the original DataFrame untouched. If you want to modify the original DataFrame directly, you can use the inplace=True
argument:
# Reset original DataFrame again
df = pd.DataFrame(data)
print("\nOriginal DataFrame (before inplace fill):")
print(df)
df.fillna(0, inplace=True) # Modifies df directly
print("\nOriginal DataFrame (after inplace fill):")
print(df)
Output:
Original DataFrame (before inplace fill):
col_a col_b col_c
0 1.0 NaN apple
1 NaN 6.0 banana
2 3.0 7.0 NaN
3 4.0 8.0 orange
4 NaN 9.0 banana
Original DataFrame (after inplace fill):
col_a col_b col_c
0 1.0 0.0 apple
1 0.0 6.0 banana
2 3.0 7.0 0
3 4.0 8.0 orange
4 0.0 9.0 banana
While inplace=True
can seem convenient, it's often recommended, especially when learning, to assign the result to a new variable or reassign it back to the original variable name (df = df.fillna(...)
). This makes the data transformation steps more explicit and can prevent unexpected side effects in complex code.
Choosing the right filling strategy requires understanding your data and the goal of your analysis. Is the missingness random? Would a mean, median, or mode introduce bias? Is the order important (suggesting ffill/bfill)? Carefully considering these questions leads to more robust data preparation.
© 2025 ApX Machine Learning