Duplicate records, or rows that contain the exact same information across all or a subset of columns, can significantly distort analytical results and model performance. They can inflate counts, skew statistical summaries like means or frequencies, and potentially lead partners to incorrect conclusions. Identifying and appropriately handling these duplicates is an important step after loading and performing initial inspections of your data.
Pandas provides a straightforward method, duplicated()
, to identify duplicate rows within a DataFrame. By default, this method checks if an entire row is an exact copy of a row that appeared earlier in the DataFrame. It returns a boolean Series where True
indicates that the row is a duplicate of a preceding row.
Let's consider a simple DataFrame:
import pandas as pd
data = {'col_a': ['apple', 'banana', 'orange', 'apple', 'banana', 'grape'],
'col_b': [1, 2, 3, 1, 2, 4],
'col_c': [True, False, True, True, False, False]}
df_duplicates = pd.DataFrame(data)
print("Original DataFrame:")
print(df_duplicates)
# Identify duplicate rows (based on all columns)
duplicate_mask = df_duplicates.duplicated()
print("\nBoolean mask for duplicates:")
print(duplicate_mask)
This code will output:
Original DataFrame:
col_a col_b col_c
0 apple 1 True
1 banana 2 False
2 orange 3 True
3 apple 1 True
4 banana 2 False
5 grape 4 False
Boolean mask for duplicates:
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
As you can see, rows 3 and 4 are marked as True
because they are exact duplicates of rows 0 and 1, respectively.
To get a count of duplicate rows, you can simply sum the boolean Series:
num_duplicates = df_duplicates.duplicated().sum()
print(f"\nNumber of duplicate rows found: {num_duplicates}")
# Output: Number of duplicate rows found: 2
To view the actual duplicate rows, you can use boolean indexing:
print("\nDuplicate rows:")
print(df_duplicates[df_duplicates.duplicated()])
Duplicate rows:
col_a col_b col_c
3 apple 1 True
4 banana 2 False
Sometimes, a duplicate is defined not by the entire row matching, but by a specific subset of columns having the same values. For instance, you might consider two records duplicates if they share the same user ID and timestamp, even if other columns differ slightly (perhaps due to data entry variations). You can specify the columns to consider using the subset
argument:
# Check duplicates based only on 'col_a' and 'col_b'
subset_duplicates = df_duplicates.duplicated(subset=['col_a', 'col_b'])
print("\nDuplicates based on 'col_a' and 'col_b':")
print(subset_duplicates)
print("\nDataFrame rows identified as duplicates based on subset:")
print(df_duplicates[subset_duplicates])
Duplicates based on 'col_a' and 'col_b':
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
DataFrame rows identified as duplicates based on subset:
col_a col_b col_c
3 apple 1 True
4 banana 2 False
In this example, the result is the same as checking all columns because the duplicates matched on all columns anyway. However, if row 3 had col_c
as False
, it would still be marked as a duplicate when using subset=['col_a', 'col_b']
.
Once identified, the most common way to handle duplicate rows is to remove them. Pandas provides the drop_duplicates()
method for this purpose. Similar to duplicated()
, it operates on the entire row by default or can be restricted to a subset
of columns.
A significant parameter for drop_duplicates()
is keep
, which determines which duplicate row to retain:
keep='first'
: (Default) Keep the first occurrence of the duplicate row and remove subsequent ones.keep='last'
: Keep the last occurrence of the duplicate row and remove preceding ones.keep=False
: Remove all rows that are part of any duplication. If a row appears twice, both instances are removed.Let's see how this works:
# Drop duplicates, keeping the first occurrence (default)
df_no_duplicates_first = df_duplicates.drop_duplicates()
print("\nDataFrame after dropping duplicates (keeping first):")
print(df_no_duplicates_first)
# Drop duplicates, keeping the last occurrence
df_no_duplicates_last = df_duplicates.drop_duplicates(keep='last')
print("\nDataFrame after dropping duplicates (keeping last):")
print(df_no_duplicates_last)
# Drop duplicates, removing all instances of duplicates
df_no_duplicates_all = df_duplicates.drop_duplicates(keep=False)
print("\nDataFrame after dropping duplicates (keeping none):")
print(df_no_duplicates_all)
# Drop duplicates based on subset, keeping first
df_no_duplicates_subset = df_duplicates.drop_duplicates(subset=['col_a', 'col_b'], keep='first')
print("\nDataFrame after dropping duplicates based on subset (keeping first):")
print(df_no_duplicates_subset)
The output demonstrates the different behaviors:
DataFrame after dropping duplicates (keeping first):
col_a col_b col_c
0 apple 1 True
1 banana 2 False
2 orange 3 True
5 grape 4 False
DataFrame after dropping duplicates (keeping last):
col_a col_b col_c
2 orange 3 True
3 apple 1 True
4 banana 2 False
5 grape 4 False
DataFrame after dropping duplicates (keeping none):
col_a col_b col_c
2 orange 3 True
5 grape 4 False
DataFrame after dropping duplicates based on subset (keeping first):
col_a col_b col_c
0 apple 1 True
1 banana 2 False
2 orange 3 True
5 grape 4 False
Notice how keep='first'
retains rows 0, 1, 2, 5; keep='last'
retains 2, 3, 4, 5; and keep=False
retains only rows 2 and 5, as they were the only unique rows in the original DataFrame. Dropping based on the subset ['col_a', 'col_b']
with keep='first'
yields the same result as the default drop_duplicates()
in this specific case.
By default, drop_duplicates()
returns a new DataFrame with the duplicates removed. The original DataFrame remains unchanged. If you want to modify the DataFrame directly, you can use the inplace=True
argument, although it's often safer and clearer to assign the result to a new variable or overwrite the existing one (df = df.drop_duplicates()
).
Handling duplicates is a standard procedure in data cleaning. Deciding whether to check all columns or a subset, and determining which duplicate to keep (if any), depends heavily on the context of your data and the goals of your analysis. Always consider the potential reasons for duplication (e.g., data entry errors, system logging issues, intentional repeats) before deciding on a removal strategy.
© 2025 ApX Machine Learning