Hands-on exercises apply principles of data transformation, including cleaning, structuring, enriching, and standardizing. Practice with a sample dataset solidifies understanding of common transformation techniques. Code writing is not the focus; instead, the emphasis is on applying transformation logic.
Imagine you've extracted the following customer data from a legacy system. The goal is to clean and standardize this data before loading it into a new marketing database.
Here's our raw data:
| CustID | FullName | RegistrationDate | State | Orders | |
|---|---|---|---|---|---|
| 101 | Alice Smith | 2022-05-15 | [email protected] | CA | 5 |
| 102 | Bob Johnson | 11/20/21 | bob | ny | 3 |
| 103 | 2023-01-10 | [email protected] | Texas | -2 | |
| 104 | Diana Lee | March 3, 2022 | diana.lee@mail | ca | 10 |
| 105 | Evan Green | 2022-08-01 | FL | 7 | |
| 106 | Fiona Adams | 07/10/23 | [email protected] | Nevada | 0 |
| 107 | George Miller | 2021-12-25 | george@mail | CA | none |
This data has several issues typical of raw extracts: missing values, inconsistent date formats, inconsistent state representations, invalid email formats, and potentially incorrect order counts. Our task is to apply transformations to fix these problems.
First, let's address the most obvious errors and missing information.
FullName and Email. How would you handle these? A common approach is to replace them with a default value like 'Unknown' or NULL, depending on the target system's requirements. Let's decide to use 'Unknown' for FullName and leave Email as potentially empty if it's truly missing (but we'll validate formats later).Orders column. CustID 103 has -2 orders, which is likely an error. CustID 107 has 'none'. How should these be handled? We might decide that negative orders are invalid and should be set to 0 or flagged for investigation. Text values like 'none' should definitely be converted to a numerical representation, likely 0.Applying the Cleaning:
FullName to 'Unknown'. Set Orders to 0.Email to an empty string or NULL (let's represent as empty for now).Orders to 0.Our data now looks like this:
| CustID | FullName | RegistrationDate | State | Orders | |
|---|---|---|---|---|---|
| 101 | Alice Smith | 2022-05-15 | [email protected] | CA | 5 |
| 102 | Bob Johnson | 11/20/21 | bob | ny | 3 |
| 103 | Unknown | 2023-01-10 | [email protected] | Texas | 0 |
| 104 | Diana Lee | March 3, 2022 | diana.lee@mail | ca | 10 |
| 105 | Evan Green | 2022-08-01 | FL | 7 | |
| 106 | Fiona Adams | 07/10/23 | [email protected] | Nevada | 0 |
| 107 | George Miller | 2021-12-25 | george@mail | CA | 0 |
Next, let's enforce consistency in formats and validate certain fields.
Email column. CustID 102 ('bob') and CustID 104/107 ('diana.lee@mail', 'george@mail') seem incomplete or invalid. A simple validation rule might check for the presence of an '@' symbol and a period '.' after the '@'. Emails failing this check could be flagged or set to a default invalid marker. Let's assume we flag them for review or set them to empty/NULL if they fail basic validation.RegistrationDate column uses multiple formats ('YYYY-MM-DD', 'MM/DD/YY', 'Month Day, YYYY'). We need to standardize this, perhaps to the 'YYYY-MM-DD' format.
State column has 'CA', 'ny', 'Texas', 'ca', 'FL', 'Nevada'. We need to standardize this to a consistent format, like the 2-letter uppercase abbreviation. We'll need a mapping (lookup) for full names or lowercase versions.
Applying Validation and Standardization:
The data after this stage:
| CustID | FullName | RegistrationDate | State | Orders | |
|---|---|---|---|---|---|
| 101 | Alice Smith | 2022-05-15 | [email protected] | CA | 5 |
| 102 | Bob Johnson | 2021-11-20 | NY | 3 | |
| 103 | Unknown | 2023-01-10 | [email protected] | TX | 0 |
| 104 | Diana Lee | 2022-03-03 | CA | 10 | |
| 105 | Evan Green | 2022-08-01 | FL | 7 | |
| 106 | Fiona Adams | 2023-07-10 | [email protected] | NV | 0 |
| 107 | George Miller | 2021-12-25 | CA | 0 |
Finally, let's restructure the data slightly to better fit our target system's schema. Assume the target database requires separate FirstName and LastName fields instead of FullName.
FullName column into FirstName and LastName. We can split the string at the first space. For 'Alice Smith', FirstName is 'Alice' and LastName is 'Smith'. What about 'Unknown'? We can set both FirstName and LastName to 'Unknown'.RegistrationDate and create a new RegistrationYear column.Applying Structuring and Enrichment:
FullName into FirstName and LastName.RegistrationYear from RegistrationDate.The final transformed data, ready for loading:
| CustID | FirstName | LastName | RegistrationDate | RegistrationYear | State | Orders | |
|---|---|---|---|---|---|---|---|
| 101 | Alice | Smith | 2022-05-15 | 2022 | [email protected] | CA | 5 |
| 102 | Bob | Johnson | 2021-11-20 | 2021 | NY | 3 | |
| 103 | Unknown | Unknown | 2023-01-10 | 2023 | [email protected] | TX | 0 |
| 104 | Diana | Lee | 2022-03-03 | 2022 | CA | 10 | |
| 105 | Evan | Green | 2022-08-01 | 2022 | FL | 7 | |
| 106 | Fiona | Adams | 2023-07-10 | 2023 | [email protected] | NV | 0 |
| 107 | George | Miller | 2021-12-25 | 2021 | CA | 0 |
Through these activities, we simulated several fundamental transformation steps:
RegistrationYear) from existing data."This transformed dataset is now much more consistent, reliable, and structured appropriately for loading into the target marketing database. While this example used simple rules and a small dataset, the principles apply to larger, more complex transformation tasks encountered in ETL processes. The key is to understand the target requirements and apply the necessary cleaning, validation, standardization, structuring, and enrichment steps systematically."
Was this section helpful?
© 2026 ApX Machine LearningEngineered with