Okay, let's enhance our data. After cleaning and standardizing, the transformation stage often involves making the data more informative than it was originally. This process is called Data Enrichment. Think of it as adding context or calculating new insights directly into your dataset before it moves further down the pipeline. While cleaning focuses on fixing what's wrong, enrichment focuses on adding what's missing or could be valuable.
Data Enrichment is the process of enhancing, refining, or otherwise improving raw data by appending related information from other sources or deriving new data points from existing ones. The goal is to make the data more useful and insightful for analysis or for the target application. Instead of just having isolated pieces of information, enrichment helps create a more complete picture.
Let's look at a few common ways data is enriched during the transformation stage.
One of the most straightforward enrichment techniques is creating new fields based on calculations performed on existing data within the same record.
Example: Imagine you have extracted sales order data with quantity
and unit_price
columns. This data is useful, but you might frequently need the total_price
for each order line. Instead of calculating this every time you query the data later, you can add it during transformation.
So, if a record has quantity = 5
and unit_price = 10.00
, the enrichment process adds a new field total_price
with the value 50.00
.
Other examples include:
first_name
and last_name
into a full_name
field.Often, the data you extract lacks important context that exists elsewhere. Enrichment can involve looking up related information from external or internal reference datasets (like database tables, spreadsheets, or even simple files) and merging it into your main data flow.
Example: Your extracted sales data might contain a product_id
, but not the product_name
or category
. You likely have a separate "Products" table or file that maps IDs to names and categories. The enrichment process can perform a lookup using the product_id
from the sales data to find the corresponding product_name
and category
in the Products data and add them as new columns to the sales record.
order_id=101
, product_id=P45
, quantity=2
id=P45
, name='Standard Widget'
, category='Widgets'
order_id=101
, product_id=P45
, product_name='Standard Widget'
, category='Widgets'
, quantity=2
Another common lookup involves using geographical codes (like zip codes or city names) to add region, state, or country information.
This diagram shows how extracted data flows into an enrichment process, which uses reference data (like product information) to produce enhanced output data containing additional fields.
Sometimes, you can derive new categorical attributes or boolean flags based on conditions applied to existing data. This helps in segmenting data or quickly identifying records of interest.
Example: Based on the total_price
calculated earlier, you might want to categorize sales:
total_price > 1000
, add a new field order_value_segment
with the value 'High'.total_price
is between 100 and 1000, set order_value_segment
to 'Medium'.Another example could be adding a boolean flag is_international
based on whether the customer's country field (perhaps added via a lookup) is different from the company's home country.
Enriching data during the transformation stage offers several advantages:
By adding calculated fields, looking up related information, and deriving new attributes, data enrichment significantly increases the value and utility of your data, preparing it effectively for the final loading stage. It moves beyond simply cleaning the data to actively enhancing its potential for generating insights.
© 2025 ApX Machine Learning