After identifying and potentially filling some missing data, another common step in data preparation is removing data that is unnecessary or problematic. You might need to remove entire rows (observations) or entire columns (features) from your DataFrame. For example:
Pandas provides the versatile drop()
method to handle both row and column removal.
To remove one or more rows, you provide the index label(s) of the rows you want to eliminate to the drop()
method. You also typically specify that you are targeting rows using the index
argument.
Let's start with a sample DataFrame:
import pandas as pd
import numpy as np
data = {'StudentID': ['S101', 'S102', 'S103', 'S104', 'S105', 'S106'],
'Math': [85, 92, 78, 88, 95, np.nan],
'Science': [90, 88, 94, 85, 79, np.nan],
'History': [76, 81, 85, 70, 88, 60],
'Notes': ['Good', 'Excellent', np.nan, 'Needs Improvement', 'Good', 'Incomplete']}
df = pd.DataFrame(data)
df.set_index('StudentID', inplace=True) # Set StudentID as the index
print("Original DataFrame:")
print(df)
This produces:
Original DataFrame:
Math Science History Notes
StudentID
S101 85.0 90.0 76 Good
S102 92.0 88.0 81 Excellent
S103 78.0 94.0 85 NaN
S104 88.0 85.0 70 Needs Improvement
S105 95.0 79.0 88 Good
S106 NaN NaN 60 Incomplete
Suppose we want to remove the student S106
because their data is incomplete. We can do this using drop()
:
# Drop a single row by index label
df_dropped_row = df.drop(index='S106')
print("\nDataFrame after dropping row 'S106':")
print(df_dropped_row)
Output:
DataFrame after dropping row 'S106':
Math Science History Notes
StudentID
S101 85.0 90.0 76 Good
S102 92.0 88.0 81 Excellent
S103 78.0 94.0 85 NaN
S104 88.0 85.0 70 Needs Improvement
S105 95.0 79.0 88 Good
Notice that df.drop()
by default returns a new DataFrame with the specified row(s) removed. The original DataFrame df
remains unchanged.
To drop multiple rows, pass a list of index labels:
# Drop multiple rows by index labels
df_dropped_rows = df.drop(index=['S103', 'S106'])
print("\nDataFrame after dropping rows 'S103' and 'S106':")
print(df_dropped_rows)
Output:
DataFrame after dropping rows 'S103' and 'S106':
Math Science History Notes
StudentID
S101 85.0 90.0 76 Good
S102 92.0 88.0 81 Excellent
S104 88.0 85.0 70 Needs Improvement
S105 95.0 79.0 88 Good
Removing columns is very similar, but instead of providing index labels, you provide column names. The preferred way to specify that you're targeting columns is by using the columns
argument.
Let's say the Notes
column isn't needed for our analysis. We can remove it:
# Drop a single column by name
df_dropped_col = df.drop(columns='Notes')
print("\nDataFrame after dropping the 'Notes' column:")
print(df_dropped_col)
Output:
DataFrame after dropping the 'Notes' column:
Math Science History
StudentID
S101 85.0 90.0 76
S102 92.0 88.0 81
S103 78.0 94.0 85
S104 88.0 85.0 70
S105 95.0 79.0 88
S106 NaN NaN 60
Again, this operation returns a new DataFrame. The original df
still has the Notes
column.
To drop multiple columns, pass a list of column names:
# Drop multiple columns by name
df_dropped_cols = df.drop(columns=['Math', 'Notes'])
print("\nDataFrame after dropping 'Math' and 'Notes' columns:")
print(df_dropped_cols)
Output:
DataFrame after dropping 'Math' and 'Notes' columns:
Science History
StudentID
S101 90.0 76
S102 88.0 81
S103 94.0 85
S104 85.0 70
S105 79.0 88
S106 NaN 60
Historically, you might see code using axis=1
to indicate column dropping (e.g., df.drop('Notes', axis=1)
). While this works, using the columns
argument (e.g., df.drop(columns='Notes')
) is generally considered more readable and explicit. Similarly, axis=0
corresponds to dropping rows (the default), but using the index
argument is clearer.
If you are certain you want to modify the original DataFrame directly without creating a new one, you can use the inplace=True
argument.
# Create a copy to modify inplace
df_copy = df.copy()
print("\nOriginal df_copy (first 3 rows):")
print(df_copy.head(3))
# Drop the 'History' column inplace
return_value = df_copy.drop(columns='History', inplace=True)
print("\ndf_copy after dropping 'History' inplace (first 3 rows):")
print(df_copy.head(3))
print(f"\nReturn value when inplace=True: {return_value}")
Output:
Original df_copy (first 3 rows):
Math Science History Notes
StudentID
S101 85.0 90.0 76 Good
S102 92.0 88.0 81 Excellent
S103 78.0 94.0 85 NaN
df_copy after dropping 'History' inplace (first 3 rows):
Math Science Notes
StudentID
S101 85.0 90.0 Good
S102 92.0 88.0 Excellent
S103 78.0 94.0 NaN
Return value when inplace=True: None
Note that when inplace=True
is used, the drop
method modifies the DataFrame directly and returns None
. Be cautious when using inplace=True
, as it permanently alters your data. It's often safer, especially when learning or exploring, to work with the default behavior which returns a modified copy, allowing you to keep track of the original data.
Dropping rows and columns using drop()
is a fundamental step in refining your dataset, allowing you to focus on the data most relevant to your analysis.
© 2025 ApX Machine Learning