Now that we understand what duplicate data is and why it's problematic, let's put theory into practice. In this section, we'll work through a hands-on example using Python and the popular pandas
library to find and remove duplicate entries from a sample dataset. This process is a standard step in preparing data for reliable analysis or model training.
First, imagine we have loaded our data into a pandas DataFrame. Let's create a small example DataFrame representing customer orders to illustrate the concepts:
import pandas as pd
# Sample data with duplicates
data = {
'order_id': [1, 2, 3, 1, 4, 5, 6, 7],
'customer_id': [10, 20, 10, 10, 30, 20, 10, 40],
'item': ['Apple', 'Banana', 'Orange', 'Apple', 'Apple', 'Banana', 'Apple', 'Grape'],
'quantity': [5, 10, 3, 5, 2, 10, 5, 20],
'status': ['Shipped', 'Pending', 'Shipped', 'Shipped', 'Delivered', 'Cancelled', 'Shipped', 'Pending']
}
orders_df = pd.DataFrame(data)
print("Original DataFrame:")
print(orders_df)
Running this would display our initial dataset:
Original DataFrame:
order_id customer_id item quantity status
0 1 10 Apple 5 Shipped # Original
1 2 20 Banana 10 Pending
2 3 10 Orange 3 Shipped
3 1 10 Apple 5 Shipped # Duplicate of row 0
4 4 30 Apple 2 Delivered
5 5 20 Banana 10 Cancelled # Same customer/item as row 1
6 6 10 Apple 5 Shipped # Duplicate of row 0
7 7 40 Grape 20 Pending
Notice rows 0, 3, and 6 appear identical across all columns. Row 5 shares the same customer_id
and item
as row 1, but has a different status
.
To find rows that are exact copies of preceding rows, pandas
provides the duplicated()
method. It returns a boolean Series, where True
indicates that a row is a duplicate of an earlier one.
# Identify complete duplicates
duplicate_rows = orders_df.duplicated()
print("\nBoolean mask for duplicate rows:")
print(duplicate_rows)
Output:
Boolean mask for duplicate rows:
0 False
1 False
2 False
3 True # This row is identical to row 0
4 False
5 False
6 True # This row is identical to row 0
7 False
dtype: bool
By default, duplicated()
marks all occurrences except the first one as duplicates. We can use this boolean Series to filter our DataFrame and view only the duplicated rows:
# Show the actual duplicate rows
print("\nComplete duplicate rows:")
print(orders_df[duplicate_rows])
Output:
Complete duplicate rows:
order_id customer_id item quantity status
3 1 10 Apple 5 Shipped
6 6 10 Apple 5 Shipped
This confirms that rows with index 3 and 6 are indeed complete duplicates of an earlier row (row 0 in this case).
Sometimes, we define duplication based on a subset of columns. For example, maybe an order is considered a duplicate if the customer_id
and item
are the same, regardless of order_id
or status
. We can check this using the subset
argument.
# Identify duplicates based on 'customer_id' and 'item'
partial_duplicates = orders_df.duplicated(subset=['customer_id', 'item'])
print("\nBoolean mask for partial duplicates (customer_id, item):")
print(partial_duplicates)
# Show rows considered duplicates based on the subset
print("\nRows considered duplicates based on 'customer_id' and 'item':")
print(orders_df[partial_duplicates])
Output:
Boolean mask for partial duplicates (customer_id, item):
0 False
1 False
2 False
3 True # Same customer_id=10, item='Apple' as row 0
4 False
5 True # Same customer_id=20, item='Banana' as row 1
6 True # Same customer_id=10, item='Apple' as row 0
7 False
dtype: bool
Rows considered duplicates based on 'customer_id' and 'item':
order_id customer_id item quantity status
3 1 10 Apple 5 Shipped
5 5 20 Banana 10 Cancelled
6 6 10 Apple 5 Shipped
Here, rows 3 and 6 are marked because they repeat the (10, 'Apple') combination from row 0. Row 5 is marked because it repeats the (20, 'Banana') combination from row 1.
Once identified, removing duplicates is straightforward using the drop_duplicates()
method.
To remove rows that are identical across all columns, we call drop_duplicates()
without any arguments. By default, it keeps the first occurrence and removes subsequent duplicates.
# Remove complete duplicate rows, keeping the first occurrence
orders_df_no_full_duplicates = orders_df.drop_duplicates()
print("\nDataFrame after removing complete duplicates (keeping first):")
print(orders_df_no_full_duplicates)
Output:
DataFrame after removing complete duplicates (keeping first):
order_id customer_id item quantity status
0 1 10 Apple 5 Shipped
1 2 20 Banana 10 Pending
2 3 10 Orange 3 Shipped
4 4 30 Apple 2 Delivered
5 5 20 Banana 10 Cancelled
7 7 40 Grape 20 Pending
Rows 3 and 6 have been removed.
You can change which row is kept using the keep
argument:
keep='first'
(default): Keep the first occurrence, drop others.keep='last'
: Keep the last occurrence, drop earlier ones.keep=False
: Drop all rows that have duplicates.Example using keep='last'
:
# Remove complete duplicates, keeping the last occurrence
orders_df_keep_last = orders_df.drop_duplicates(keep='last')
print("\nDataFrame after removing complete duplicates (keeping last):")
print(orders_df_keep_last)
Output:
DataFrame after removing complete duplicates (keeping last):
order_id customer_id item quantity status
1 2 20 Banana 10 Pending
2 3 10 Orange 3 Shipped
4 4 30 Apple 2 Delivered
5 5 20 Banana 10 Cancelled
6 6 10 Apple 5 Shipped # Kept last duplicate of (10, 'Apple')
7 7 40 Grape 20 Pending
Now, row 6 is kept instead of row 0, and rows 0 and 3 are dropped.
We can also use subset
and keep
with drop_duplicates
. Let's remove rows where the customer_id
and item
are duplicated, keeping only the first instance of each combination.
# Remove duplicates based on 'customer_id' and 'item', keeping the first
orders_df_no_partial_duplicates = orders_df.drop_duplicates(subset=['customer_id', 'item'], keep='first')
print("\nDataFrame after removing partial duplicates (customer_id, item - keeping first):")
print(orders_df_no_partial_duplicates)
Output:
DataFrame after removing partial duplicates (customer_id, item - keeping first):
order_id customer_id item quantity status
0 1 10 Apple 5 Shipped # First (10, Apple)
1 2 20 Banana 10 Pending # First (20, Banana)
2 3 10 Orange 3 Shipped
4 4 30 Apple 2 Delivered
7 7 40 Grape 20 Pending
Rows 3, 5, and 6 were removed because their (customer_id, item)
pairs ((10, 'Apple')
and (20, 'Banana')
) had already appeared.
After removing duplicates, it's good practice to verify the result. You can check the dimensions of the new DataFrame or re-run the duplicated().sum()
check.
# Verify no complete duplicates remain in orders_df_no_full_duplicates
remaining_duplicates = orders_df_no_full_duplicates.duplicated().sum()
print(f"\nNumber of remaining complete duplicates: {remaining_duplicates}")
# Verify no partial duplicates remain in orders_df_no_partial_duplicates
remaining_partial = orders_df_no_partial_duplicates.duplicated(subset=['customer_id', 'item']).sum()
print(f"Number of remaining partial duplicates (customer_id, item): {remaining_partial}")
Output:
Number of remaining complete duplicates: 0
Number of remaining partial duplicates (customer_id, item): 0
This confirms that the respective drop_duplicates
operations worked as expected.
This practical exercise demonstrates how to effectively identify and manage duplicate data using standard tools. Deciding whether to check for full duplicates or use a subset depends entirely on the context of your data and the goals of your analysis. Removing duplicates ensures that each record contributes appropriately to your findings and prevents inflated counts or biased model results.
© 2025 ApX Machine Learning