Okay, let's put theory into practice. We've discussed how the Extraction stage works. Now, we'll simulate the process of extracting data from a common source: a simple Comma-Separated Values (CSV) file. This exercise will help solidify your understanding of how data is initially retrieved in an ETL workflow.
Imagine you work for an online retailer. Every day, a new file named orders.csv
is generated containing the details of orders placed. Your task is to extract this data so it can be processed later.
Let's assume the orders.csv
file for today looks like this:
OrderID,CustomerID,OrderDate,Amount,Status
101,CUST-A,2023-10-26,150.00,Shipped
102,CUST-B,2023-10-26,75.50,Processing
103,CUST-A,2023-10-27,210.25,Shipped
104,CUST-C,2023-10-28,30.00,Pending
This is structured data. Each line represents an order, and the values are separated by commas. The first line is the header, defining the fields (columns).
In a real ETL pipeline, specialized tools or scripts would establish a connection to the file system (or database, or API endpoint) where orders.csv
resides. For this simulation, we'll assume we have successfully connected and can access the file's content.
As discussed earlier, a full extraction involves reading the entire dataset from the source. If we perform a full extraction on orders.csv
, the raw data pulled into our ETL process would look something like this (represented here as a list of records):
[
{"OrderID": "101", "CustomerID": "CUST-A", "OrderDate": "2023-10-26", "Amount": "150.00", "Status": "Shipped"},
{"OrderID": "102", "CustomerID": "CUST-B", "OrderDate": "2023-10-26", "Amount": "75.50", "Status": "Processing"},
{"OrderID": "103", "CustomerID": "CUST-A", "OrderDate": "2023-10-27", "Amount": "210.25", "Status": "Shipped"},
{"OrderID": "104", "CustomerID": "CUST-C", "OrderDate": "2023-10-28", "Amount": "30.00", "Status": "Pending"}
]
Notice that the data is extracted exactly as it appears in the file, including the data types represented as strings. The structure (field names) is derived from the header row. This is the raw material for the next stage.
A diagram showing the full extraction process reading the entire source file (
orders.csv
) to produce raw extracted data.
Now, let's imagine some time has passed, and a new order has been added to the orders.csv
file:
OrderID,CustomerID,OrderDate,Amount,Status
101,CUST-A,2023-10-26,150.00,Shipped
102,CUST-B,2023-10-26,75.50,Processing
103,CUST-A,2023-10-27,210.25,Shipped
104,CUST-C,2023-10-28,30.00,Pending
105,CUST-B,2023-10-28,99.99,Processing <-- New Order
If we were using an incremental extraction strategy, the goal would be to extract only the new or changed data since the last extraction. Assuming our ETL process keeps track of the last OrderID
extracted (which was 104), it would query or scan the source for orders with an ID greater than 104.
In this simulation, the incremental extraction would yield only the new record:
[
{"OrderID": "105", "CustomerID": "CUST-B", "OrderDate": "2023-10-28", "Amount": "99.99", "Status": "Processing"}
]
This is much more efficient than re-extracting the entire file if the source dataset is large and changes are relatively small. The logic for identifying changes (like using IDs, timestamps, or CDC methods) is central to incremental strategies.
What if the source file had an error? For example:
OrderID,CustomerID,OrderDate,Amount,Status
101,CUST-A,2023-10-26,150.00,Shipped
102,CUST-B,2023-10-26,,Processing <-- Missing Amount
103,CUST-A,2023-10-27,210.25,Shipped
During extraction, the process might encounter this missing Amount
for OrderID
102. Depending on the ETL tool or script's configuration for handling errors:
Amount
.Often, the goal of extraction is just to get the data, warts and all. The cleaning and validation typically happen during the Transformation stage. So, a common outcome would be extracting the record with the missing value noted:
[
{"OrderID": "101", "CustomerID": "CUST-A", "OrderDate": "2023-10-26", "Amount": "150.00", "Status": "Shipped"},
{"OrderID": "102", "CustomerID": "CUST-B", "OrderDate": "2023-10-26", "Amount": "", "Status": "Processing"}, // Note the empty Amount
{"OrderID": "103", "CustomerID": "CUST-A", "OrderDate": "2023-10-27", "Amount": "210.25", "Status": "Shipped"}
]
This simulation walked through extracting data from a simple CSV file. We saw how:
You now have a more concrete picture of the Extraction process. It's about reliably retrieving the necessary raw data from its origin, preparing it for the subsequent steps in the ETL pipeline. Next, we'll explore what happens in the Transformation stage, where this raw data gets cleaned, reshaped, and enriched.
© 2025 ApX Machine Learning