Sometimes, a duplicate record isn't an exact copy of another row across every single column. Instead, duplication might be defined by having identical values in only some specific columns. Think about a dataset of customer orders. You might have multiple rows for the same customer (identified by a customer_id
) because they placed different orders at different times. While the order_id
or order_date
might be different, if you're trying to count unique customers, these rows represent the same entity based on the customer_id
.
Identifying these "partial" duplicates is essential for many analyses. If you're analyzing unique customer counts, website user sessions, or distinct product entries, you need to look for repetitions within the columns that define uniqueness for that specific context. Ignoring this can lead to inflated counts, skewed statistics, and faulty conclusions. For example, calculating the average order value per customer would be incorrect if you treated each order row as a separate customer.
The first step is to determine which column, or combination of columns, uniquely identifies the entity you care about. This decision depends heavily on your dataset and your analysis goals.
user_id
or email
column often serves as a unique identifier.product_sku
or isbn
might identify unique products.session_id
and user_id
might define a unique session.first_name
, last_name
, and date_of_birth
to establish uniqueness, especially if there's no single ID column.Carefully consider what constitutes a unique record for your specific task.
Most data analysis libraries provide straightforward ways to check for duplicates based on a subset of columns. Let's look at how you might do this using the popular Python library, pandas.
Imagine you have a DataFrame like this:
customer_id | name | signup_date | last_purchase |
---|---|---|---|
101 | Alice | 2023-01-15 | 2023-05-20 |
102 | Bob | 2023-02-10 | 2023-06-11 |
101 | Alice | 2023-01-15 | 2023-07-01 |
103 | Charlie | 2023-03-05 | 2023-03-05 |
102 | Bob | 2023-02-10 | 2023-08-15 |
Here, rows 0 and 2 represent the same customer (ID 101), and rows 1 and 4 represent another customer (ID 102). If we are interested in unique customers, the customer_id
column is our identifier.
We can use the duplicated()
method in pandas, but this time we specify the subset
parameter:
import pandas as pd
data = {'customer_id': [101, 102, 101, 103, 102],
'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'signup_date': ['2023-01-15', '2023-02-10', '2023-01-15', '2023-03-05', '2023-02-10'],
'last_purchase': ['2023-05-20', '2023-06-11', '2023-07-01', '2023-03-05', '2023-08-15']}
df = pd.DataFrame(data)
# Check for duplicates based on 'customer_id'
# Keep='first' means mark subsequent occurrences as duplicates
duplicates_mask = df.duplicated(subset=['customer_id'], keep='first')
print(duplicates_mask)
This code will output a Series of boolean values:
0 False
1 False
2 True
3 False
4 True
dtype: bool
Here, False
means the customer_id
is seen for the first time (or is unique in the preceding rows), and True
indicates that this customer_id
has already appeared in an earlier row. Row 2 is marked True
because customer_id
101 appeared in row 0. Row 4 is marked True
because customer_id
102 appeared in row 1.
You can use this boolean mask to filter your DataFrame and see the duplicate entries based on the specific columns:
# Show the rows marked as duplicates based on 'customer_id'
print(df[duplicates_mask])
This would display:
customer_id | name | signup_date | last_purchase |
---|---|---|---|
101 | Alice | 2023-01-15 | 2023-07-01 |
102 | Bob | 2023-02-10 | 2023-08-15 |
You can also specify multiple columns in the subset
list if uniqueness is defined by a combination:
# Example: Check duplicates based on both 'name' AND 'signup_date'
duplicates_combo_mask = df.duplicated(subset=['name', 'signup_date'], keep='first')
print(duplicates_combo_mask)
# In our sample data, the result would be the same as checking only 'customer_id'
# because name/signup_date combinations are also unique per customer.
# 0 False
# 1 False
# 2 True
# 3 False
# 4 True
# dtype: bool
Identifying duplicates based on specific columns often reveals a different picture of your data compared to looking at complete row duplicates or the raw row count. A simple visualization can highlight this. For instance, we can count the total rows versus the number of unique entries based on our chosen key column(s).
Comparison showing the total number of records versus the number of unique customer IDs found in the sample data.
This clearly shows that while we have 5 records in total, there are only 3 unique customers according to the customer_id
column.
Identifying duplicates based on specific columns is a common and important task in data cleaning. It allows you to correctly count unique entities, avoid redundant processing, and ensure your analysis reflects the true underlying structure of your data. The next step, covered in the following section, is deciding how to handle these identified duplicates, which usually involves removing them strategically.
© 2025 ApX Machine Learning