Now that you understand the components of a data pipeline, including extraction, transformation, loading, and the differences between ETL and ELT, let's put this knowledge into practice. This exercise involves sketching out a basic data pipeline for a common scenario. The goal isn't to write code but to visualize the flow of data and the steps involved.
Scenario: Daily Website User Activity Report
Imagine you work for an e-commerce company. Your team needs a daily report summarizing the number of unique pages each user visited on the website the previous day. This report will help the marketing team understand user engagement.
Here's what you know:
- Data Source: Website activity is logged as JSON files. A new file is generated each day containing all user interactions (clicks, page views) from that day. Each record in the file includes a timestamp, user ID, page URL, and other details. These files are stored in a designated cloud storage location.
- Requirement: Produce a summary table in the company's relational database. This table should contain three columns:
report_date
, user_id
, and distinct_pages_visited
.
- Frequency: The report needs to be updated once daily, summarizing the previous day's activity.
Your Task
Sketch a data pipeline to fulfill this requirement. You can draw a simple diagram (like boxes and arrows) on paper or use a text editor to list the steps.
Consider the following points in your sketch:
- Identify the Source: Where does the data originate?
- Identify the Target: Where should the final, processed data reside?
- Extraction Step: How will you get the data from the source?
- Transformation Steps: What needs to happen to the raw data to meet the report requirements? Think about:
- Reading the data format (JSON).
- Selecting necessary fields (user ID, page URL).
- Calculating the distinct page count per user for the day.
- Adding the report date.
- Loading Step: How will the transformed data get into the target database?
- ETL or ELT? Does your process resemble Extract, Transform, Load or Extract, Load, Transform more closely?
- Orchestration: How will this pipeline run automatically each day? (Think simple scheduling).
Take a few minutes to outline your pipeline based on the scenario above.
Example Sketch and Explanation
Here’s one possible way to sketch this pipeline. We can represent it visually using a simple diagram.
A diagram illustrating the data flow from JSON logs to a relational database table, triggered by a daily scheduler.
Let's break down the steps corresponding to this diagram:
- Source: Daily JSON log files stored in cloud storage.
- Target: A table named
user_daily_activity
(for example) in a relational database.
- Extraction: A process reads the relevant JSON log file from the previous day from cloud storage.
- Transformation:
- The process parses each JSON record.
- It extracts the
user_id
and page_url
fields.
- It groups the records by
user_id
.
- For each
user_id
, it calculates the count of distinct page_url
values.
- It adds the date for which the report is generated (e.g., yesterday's date).
- Loading: The process inserts the resulting summarized data (report_date, user_id, distinct_pages_visited) into the
user_daily_activity
table in the database.
- ETL or ELT? This approach follows the ETL pattern. The data is extracted, transformed (parsed, aggregated) in memory or a temporary processing environment, and then the final result is loaded into the database. An ELT approach might load the raw JSON data into a staging area within the database first, and then use SQL queries to perform the transformation and aggregation directly within the database.
- Orchestration: A simple daily scheduler (like a cron job or a basic cloud scheduler service) is set up to trigger this entire pipeline process once per day, typically early in the morning after the previous day's log file is complete and available.
Reflecting on Your Sketch
Compare your sketch to the example. Did you identify similar steps?
- Did you clearly separate extraction, transformation, and loading?
- What specific transformations did you identify? (e.g., counting distinct items, grouping)
- How did you envision the process running daily?
This exercise demonstrates how the building blocks discussed in this chapter fit together to solve a practical data movement and processing problem. Even simple requirements involve careful thought about data flow, processing steps, and scheduling. As requirements become more complex, these pipeline designs naturally grow, incorporating more sophisticated tools and techniques, which you will learn about as you continue your data engineering education.