The design of a simple ETL pipeline applies principles from the individual E, T, and L stages, their connection into a workflow, and the basic understanding of tools, scheduling, and monitoring. Here, the task is to apply these concepts by designing a basic ETL pipeline. This process does not involve writing code or using specific software; the primary goal is to think through the process and sketch out its structure, much like an architect creates blueprints before construction begins.
The goal of this hands-on activity is to design the workflow for a basic ETL pipeline based on a common business scenario. We will define the sources, transformations, target, and the sequence of operations. You can do this on paper, using a whiteboard, or with a simple diagramming tool.
Imagine you work for a small online retailer. Every day, you need to create a consolidated report of the previous day's sales. The sales data comes from two different places:
online_sales.csv) generated daily. It contains order_id, customer_email, product_sku, quantity, sale_amount, and order_timestamp (YYYY-MM-DD HH:MM:SS format).kiosk_sales). It has columns transaction_id, customer_id, item_code, units_sold, total_price, and sale_date (MM/DD/YYYY format).The goal is to create a single, clean dataset (perhaps another CSV file or a database table) named daily_consolidated_sales with the following columns: sale_id (a unique identifier for the sale), source ('Online' or 'Kiosk'), product_identifier, quantity_sold, revenue, and sale_datetime (standardized ISO 8601 format: YYYY-MM-DDTHH:MM:SS). We only want sales from the previous day.
Let's break down what we need to achieve:
online_sales.csv file, kiosk_sales database table.daily_consolidated_sales.csv or a database table).product_sku and item_code to product_identifier; quantity and units_sold to quantity_sold; sale_amount and total_price to revenue).order_timestamp and sale_date to sale_datetime in ISO 8601).source column indicating 'Online' or 'Kiosk'.sale_id for each record in the final dataset.Based on the requirements, let's outline the specific actions within each stage:
online_sales.csv file. Filter records where order_timestamp corresponds to the previous day. Select order_id, product_sku, quantity, sale_amount, order_timestamp.kiosk_sales table. Filter records where sale_date corresponds to the previous day. Select transaction_id, item_code, units_sold, total_price, sale_date.product_sku -> product_identifier, quantity -> quantity_sold, sale_amount -> revenue.order_timestamp to ISO 8601 format and rename to sale_datetime.source with the value 'Online'.order_id as a basis for the final sale_id (or generate a new unique ID).item_code -> product_identifier, units_sold -> quantity_sold, total_price -> revenue.sale_date (MM/DD/YYYY) to ISO 8601 format (YYYY-MM-DDTHH:MM:SS, perhaps assuming midnight time if time is not available) and rename to sale_datetime.source with the value 'Kiosk'.transaction_id as a basis for the final sale_id (or generate a new unique ID ensuring uniqueness across both sources).sale_id is unique across all combined records. If using source IDs (order_id, transaction_id), prefixing them (e.g., 'ONL-' + order_id, 'KSK-' + transaction_id) might be a simple way to ensure uniqueness.daily_consolidated_sales.csv or truncate and load into the daily_consolidated_sales database table).Now, let's visualize the flow and how the tasks depend on each other. The extraction tasks (E1 and E2) can often run in parallel since they draw from different sources. However, all transformation tasks (T1, T2) depend on their respective extraction tasks completing. The final combination step (T3) depends on both T1 and T2 being finished. Finally, the load task (L1) can only start after the combined data is ready (T3 complete).
A diagram illustrating the workflow for the Daily Sales Report Consolidation pipeline. Extraction tasks E1 and E2 run first, potentially in parallel. Their outputs feed into transformation tasks T1 and T2 respectively. Task T3 combines the results of T1 and T2. Finally, task L1 loads the result of T3 into the target, marking the end of the process.
While designing, briefly think about:
cron on Linux/macOS, Task Scheduler on Windows, or a workflow orchestrator tool).online_sales.csv file is missing? Or the database is down? Basic logging should record the start, end, and any errors encountered during each task (E1, E2, T1, T2, T3, L1). Alerting might be needed for critical failures.You have now designed a simple, yet complete, ETL pipeline on paper (or screen). You identified the requirements, broke the process down into Extract, Transform, and Load stages, defined the specific tasks within each stage, and mapped out the workflow dependencies. This structured thinking process is fundamental to building reliable data pipelines, regardless of the complexity or the specific tools you eventually use for implementation. Sketching out pipelines like this helps clarify the logic and identify potential issues before you start building.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with