Alright, let's put theory into practice. In the previous sections, we discussed why data cleaning is essential for reliable processing and analysis. Raw data often arrives messy, incomplete, or inconsistent. Now, we'll walk through a simple example to see how basic cleaning operations can be applied using Python, a common language in data engineering, along with the Pandas library, which is excellent for data manipulation.
Don't worry if you haven't used Python or Pandas extensively before. We'll keep the code straightforward and explain each step. The goal here is to understand the process of cleaning, not to become a Python expert overnight.
Imagine we have a small list of customer contacts collected from different sources. As often happens, the data isn't perfectly uniform. Our task is to clean it up slightly.
Here's our sample raw data:
Name | Phone | Joined Date | City | |
---|---|---|---|---|
Alice Smith | alice.smith@example.com | 555-123-4567 | 2023-01-15 | New York |
Bob Johnson | BOB.J@EXAMPLE.COM | 555-987-6543 | 2023/02/20 | Los Angeles |
Charlie Brown | charlie@sample.org | (555) 555-1212 | Mar 05, 2023 | |
Diana Prince | diana.prince@example.net | nan | 2023-04-10 | San Francisco |
Eve Adams | eve.adams@sample.net | 555.888.9999 | 2023/05/18 | new york |
Even in this tiny dataset, we can spot a few issues:
nan
often signifies missing data), and Charlie Brown is missing a City.-
, ()
, .
). Dates have different formats (YYYY-MM-DD
, YYYY/MM/DD
, Mon DD, YYYY
). City names have inconsistent capitalization ('New York' vs 'new york').Let's write a simple script to address some of these.
First, you'd typically load this data into a structure that makes manipulation easy. In Python, we use a Pandas DataFrame. Imagine we've loaded the table above into a DataFrame called df
.
# Assume 'df' is a Pandas DataFrame holding the data above
# You would normally load this from a file (like a CSV)
# For example:
# import pandas as pd
# df = pd.read_csv('your_data.csv') # Or other source
# Let's represent the data for clarity (you don't run this exact part)
data = {'Name': ['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'Diana Prince', 'Eve Adams'],
'Email': ['alice.smith@example.com', 'BOB.J@EXAMPLE.COM', 'charlie@sample.org', 'diana.prince@example.net', 'eve.adams@sample.net'],
'Phone': ['555-123-4567', '555-987-6543', '(555) 555-1212', None, '555.888.9999'], # Using None for missing
'Joined Date': ['2023-01-15', '2023/02/20', 'Mar 05, 2023', '2023-04-10', '2023/05/18'],
'City': ['New York', 'Los Angeles', None, 'San Francisco', 'new york'] # Using None for missing
}
# In a real script, you'd create the DataFrame:
# df = pd.DataFrame(data)
# print("Original Data:")
# print(df)
Inconsistent capitalization in emails can cause issues if you're trying to find unique users. A simple fix is to convert them all to lowercase.
# Convert the 'Email' column to lowercase
df['Email'] = df['Email'].str.lower()
# print("\nData after standardizing email case:")
# print(df['Email'])
Now, BOB.J@EXAMPLE.COM
becomes bob.j@example.com
.
We see Diana Prince is missing a phone number. How we handle this depends on the requirements. We could drop the row, try to find the missing number, or fill it with a placeholder. For simplicity here, let's fill it with a standard placeholder like 'Not Provided'.
# Fill missing values (None or NaN) in the 'Phone' column
df['Phone'] = df['Phone'].fillna('Not Provided')
# print("\nData after handling missing phone numbers:")
# print(df['Phone'])
Diana's phone number entry will now be 'Not Provided'.
Similar to emails, city names have inconsistent capitalization. Let's standardize them to title case (e.g., 'New York'). We also need to handle the missing city for Charlie Brown. Let's fill it with 'Unknown'.
# Fill missing city names first
df['City'] = df['City'].fillna('Unknown')
# Convert the 'City' column to title case
df['City'] = df['City'].str.title()
# print("\nData after standardizing city names:")
# print(df['City'])
Now 'new york' becomes 'New York', and Charlie's missing city becomes 'Unknown'.
Phone number formatting is tricky. A full solution involves complex pattern matching (regular expressions). For this basic exercise, let's just remove common separators like -
, ()
,
, and .
. This isn't perfect, but it's a start towards standardization.
# Remove common separators from phone numbers
# This uses chained .str.replace() calls
df['Phone'] = df['Phone'].str.replace('-', '', regex=False)\
.str.replace('(', '', regex=False)\
.str.replace(')', '', regex=False)\
.str.replace('.', '', regex=False)\
.str.replace(' ', '', regex=False)
# print("\nData after standardizing phone format:")
# print(df['Phone'])
Now 555-123-4567
becomes 5551234567
, (555) 555-1212
becomes 5555551212
, etc. 'Not Provided' remains unchanged.
Date formats are also diverse. Pandas has powerful tools for this. We can try to automatically infer and convert the 'Joined Date' column to a standard datetime format.
# Convert 'Joined Date' to datetime objects, inferring format
# The errors='coerce' part will turn unparseable dates into NaT (Not a Time)
df['Joined Date'] = pd.to_datetime(df['Joined Date'], errors='coerce')
# Optional: Convert back to a standard string format like YYYY-MM-DD
df['Joined Date'] = df['Joined Date'].dt.strftime('%Y-%m-%d')
# print("\nData after standardizing date format:")
# print(df['Joined Date'])
All dates should now be in the YYYY-MM-DD
format. If a date couldn't be parsed, it might appear as NaT
or None
depending on subsequent handling, which signals another potential data quality issue to investigate.
After running these steps, our DataFrame df
would look something like this:
Name | Phone | Joined Date | City | |
---|---|---|---|---|
Alice Smith | alice.smith@example.com | 5551234567 | 2023-01-15 | New York |
Bob Johnson | bob.j@example.com | 5559876543 | 2023-02-20 | Los Angeles |
Charlie Brown | charlie@sample.org | 5555551212 | 2023-03-05 | Unknown |
Diana Prince | diana.prince@example.net | Not Provided | 2023-04-10 | San Francisco |
Eve Adams | eve.adams@sample.net | 5558889999 | 2023-05-18 | New York |
Here's a small diagram illustrating the types of cleaning actions we performed:
This diagram shows the flow from raw, messy data through several cleaning steps (standardizing case, filling missing values, unifying formats) to arrive at a more consistent dataset ready for processing.
This exercise demonstrates just a few basic data cleaning techniques. Real-world data often requires more sophisticated methods, including outlier detection, more complex validation rules, and potentially using external data sources for enrichment or correction. However, the fundamental principles remain the same: identify inconsistencies and errors, and apply transformations to improve data quality, making it suitable for reliable batch or stream processing.
© 2025 ApX Machine Learning