Sometimes, the exact same record appears multiple times in your dataset. This might happen because of data entry errors, system glitches during data collection, or issues when combining data from different sources. These identical entries are called complete duplicate rows. They are rows where the values in every single column match the values in another row perfectly.
Think of it like having two identical business cards for the same person in your collection. They provide the exact same information. In a dataset, these complete duplicates don't add new information but can inflate counts, skew statistical summaries (like averages or sums), and potentially bias algorithms trained on the data.
Identifying these rows is typically the first step in handling duplicates because it's the most straightforward case. We are looking for perfect matches across the entire record.
Most data analysis tools provide functions to detect these identical rows. Let's illustrate this using Python with the popular pandas library, often used for data manipulation. Imagine we have a simple dataset of customer feedback scores:
CustomerID | SessionID | FeedbackDate | Score | Product |
---|---|---|---|---|
C101 | S201 | 2023-10-26 | 8 | WidgetA |
C102 | S202 | 2023-10-26 | 9 | WidgetB |
C101 | S203 | 2023-10-27 | 7 | WidgetA |
C102 | S202 | 2023-10-26 | 9 | WidgetB |
C103 | S204 | 2023-10-27 | 10 | WidgetC |
C101 | S201 | 2023-10-26 | 8 | WidgetA |
Looking closely, you can see that the second row (index 1: C102, S202, ...) is identical to the fourth row (index 3). Also, the first row (index 0: C101, S201, ...) is identical to the last row (index 5). These are complete duplicates.
If this data is loaded into a pandas DataFrame named feedback_df
, we can use the .duplicated()
method to identify them:
# Import pandas if you haven't already
import pandas as pd
# Example DataFrame (replace with your actual data loading)
data = {
'CustomerID': ['C101', 'C102', 'C101', 'C102', 'C103', 'C101'],
'SessionID': ['S201', 'S202', 'S203', 'S202', 'S204', 'S201'],
'FeedbackDate': ['2023-10-26', '2023-10-26', '2023-10-27', '2023-10-26', '2023-10-27', '2023-10-26'],
'Score': [8, 9, 7, 9, 10, 8],
'Product': ['WidgetA', 'WidgetB', 'WidgetA', 'WidgetB', 'WidgetC', 'WidgetA']
}
feedback_df = pd.DataFrame(data)
# Check for complete duplicate rows
# By default, it marks subsequent occurrences as True
is_duplicate = feedback_df.duplicated()
# Display the boolean result for each row
print(is_duplicate)
The duplicated()
method returns a pandas Series containing boolean values (True
or False
). The length of this Series matches the number of rows in the original DataFrame.
The output for our example would look like this:
0 False
1 False
2 False
3 True
4 False
5 True
dtype: bool
Here's how to interpret this:
False
: Indicates that the row is either unique within the dataset examined so far, or it's the first occurrence of a set of duplicate rows. By default, pandas considers the first instance it encounters as the "original" and not a duplicate.True
: Indicates that this row is an exact copy of a row that appeared earlier in the DataFrame. In our example, row 3 is identified as True
because it's identical to row 1. Row 5 is marked True
because it's identical to row 0.This boolean Series acts as a flag, pinpointing the rows that are redundant copies based on all columns matching. This identification is essential before proceeding to the next step, which usually involves deciding whether to remove these flagged duplicates to clean the dataset. We'll cover the removal process in the section "Removing Duplicate Rows".
© 2025 ApX Machine Learning