Now that we've discussed the why and what of data transformation, cleaning, structuring, enriching, and standardizing, let's put these ideas into practice. This section provides hands-on exercises using a sample dataset to solidify your understanding of common transformation techniques. We won't be writing code here; the focus is on applying the logic you've learned.
Imagine you've extracted the following customer data from a legacy system. The goal is to clean and standardize this data before loading it into a new marketing database.
Here's our raw data:
CustID | FullName | RegistrationDate | State | Orders | |
---|---|---|---|---|---|
101 | Alice Smith | 2022-05-15 | alice.s@mail.com | CA | 5 |
102 | Bob Johnson | 11/20/21 | bob | ny | 3 |
103 | 2023-01-10 | charlie@mail.com | Texas | -2 | |
104 | Diana Lee | March 3, 2022 | diana.lee@mail | ca | 10 |
105 | Evan Green | 2022-08-01 | FL | 7 | |
106 | Fiona Adams | 07/10/23 | fiona.a@mail.com | Nevada | 0 |
107 | George Miller | 2021-12-25 | george@mail | CA | none |
This data has several issues typical of raw extracts: missing values, inconsistent date formats, inconsistent state representations, invalid email formats, and potentially incorrect order counts. Our task is to apply transformations to fix these problems.
First, let's address the most obvious errors and missing information.
FullName
and Email
. How would you handle these? A common approach is to replace them with a default value like 'Unknown' or NULL
, depending on the target system's requirements. Let's decide to use 'Unknown' for FullName
and leave Email
as potentially empty if it's truly missing (but we'll validate formats later).Orders
column. CustID 103 has -2 orders, which is likely an error. CustID 107 has 'none'. How should these be handled? We might decide that negative orders are invalid and should be set to 0 or flagged for investigation. Text values like 'none' should definitely be converted to a numerical representation, likely 0.Applying the Cleaning:
FullName
to 'Unknown'. Set Orders
to 0.Email
to an empty string or NULL
(let's represent as empty for now).Orders
to 0.Our data now looks like this:
CustID | FullName | RegistrationDate | State | Orders | |
---|---|---|---|---|---|
101 | Alice Smith | 2022-05-15 | alice.s@mail.com | CA | 5 |
102 | Bob Johnson | 11/20/21 | bob | ny | 3 |
103 | Unknown | 2023-01-10 | charlie@mail.com | Texas | 0 |
104 | Diana Lee | March 3, 2022 | diana.lee@mail | ca | 10 |
105 | Evan Green | 2022-08-01 | FL | 7 | |
106 | Fiona Adams | 07/10/23 | fiona.a@mail.com | Nevada | 0 |
107 | George Miller | 2021-12-25 | george@mail | CA | 0 |
Next, let's enforce consistency in formats and validate certain fields.
Email
column. CustID 102 ('bob') and CustID 104/107 ('diana.lee@mail', 'george@mail') seem incomplete or invalid. A simple validation rule might check for the presence of an '@' symbol and a period '.' after the '@'. Emails failing this check could be flagged or set to a default invalid marker. Let's assume we flag them for review or set them to empty/NULL if they fail basic validation.RegistrationDate
column uses multiple formats ('YYYY-MM-DD', 'MM/DD/YY', 'Month Day, YYYY'). We need to standardize this, perhaps to the 'YYYY-MM-DD' format.
State
column has 'CA', 'ny', 'Texas', 'ca', 'FL', 'Nevada'. We need to standardize this to a consistent format, like the 2-letter uppercase abbreviation. We'll need a mapping (lookup) for full names or lowercase versions.
Applying Validation and Standardization:
The data after this stage:
CustID | FullName | RegistrationDate | State | Orders | |
---|---|---|---|---|---|
101 | Alice Smith | 2022-05-15 | alice.s@mail.com | CA | 5 |
102 | Bob Johnson | 2021-11-20 | NY | 3 | |
103 | Unknown | 2023-01-10 | charlie@mail.com | TX | 0 |
104 | Diana Lee | 2022-03-03 | CA | 10 | |
105 | Evan Green | 2022-08-01 | FL | 7 | |
106 | Fiona Adams | 2023-07-10 | fiona.a@mail.com | NV | 0 |
107 | George Miller | 2021-12-25 | CA | 0 |
Finally, let's restructure the data slightly to better fit our target system's schema. Assume the target database requires separate FirstName
and LastName
fields instead of FullName
.
FullName
column into FirstName
and LastName
. We can split the string at the first space. For 'Alice Smith', FirstName
is 'Alice' and LastName
is 'Smith'. What about 'Unknown'? We can set both FirstName
and LastName
to 'Unknown'.RegistrationDate
and create a new RegistrationYear
column.Applying Structuring and Enrichment:
FullName
into FirstName
and LastName
.RegistrationYear
from RegistrationDate
.The final transformed data, ready for loading:
CustID | FirstName | LastName | RegistrationDate | RegistrationYear | State | Orders | |
---|---|---|---|---|---|---|---|
101 | Alice | Smith | 2022-05-15 | 2022 | alice.s@mail.com | CA | 5 |
102 | Bob | Johnson | 2021-11-20 | 2021 | NY | 3 | |
103 | Unknown | Unknown | 2023-01-10 | 2023 | charlie@mail.com | TX | 0 |
104 | Diana | Lee | 2022-03-03 | 2022 | CA | 10 | |
105 | Evan | Green | 2022-08-01 | 2022 | FL | 7 | |
106 | Fiona | Adams | 2023-07-10 | 2023 | fiona.a@mail.com | NV | 0 |
107 | George | Miller | 2021-12-25 | 2021 | CA | 0 |
Through these activities, we simulated several fundamental transformation steps:
RegistrationYear
) from existing data.This transformed dataset is now much more consistent, reliable, and structured appropriately for loading into the target marketing database. While this example used simple rules and a small dataset, the principles apply to larger, more complex transformation tasks encountered in real-world ETL processes. The key is to understand the target requirements and apply the necessary cleaning, validation, standardization, structuring, and enrichment steps systematically.
© 2025 ApX Machine Learning