ETL processes exist to move data. But where does this data come from, and where does it end up? Understanding the typical starting points (sources) and destinations (targets) is fundamental to grasping how ETL works. Think of it like planning a trip: you need to know your origin and your destination before you can map the route.
Data sources are the systems, files, or applications where the raw data originates. In today's world, data can come from a huge variety of places. Here are some of the most common ones you'll encounter:
Relational Databases: These are often the backbone of business applications. They store data in a structured format using tables, which are organized into rows and columns. Think of spreadsheets, but much more powerful and reliable. Examples include:
Files: Data is frequently stored in files, especially for exchanging information between systems or for logging. Common file types include:
APIs (Application Programming Interfaces): Many web services and applications expose APIs that allow other programs to request data. When an ETL process needs data from a third-party service (like a weather service, a social media platform, or a payment gateway), it often interacts with its API. Data from APIs is frequently delivered in JSON format.
Streaming Sources: Increasingly, data is generated continuously. Think of sensor readings from IoT devices, website clickstreams, or financial market data. These "streams" require different extraction approaches compared to static databases or files, often involving technologies like Apache Kafka or cloud-specific streaming services. While we focus on batch ETL in this introductory course, it's good to know these sources exist.
Data targets are the systems or locations where the transformed data is loaded. The choice of target depends heavily on what you intend to do with the data.
Data Warehouses: This is perhaps the most traditional and common target for ETL processes focused on analytics. A data warehouse is a central repository designed specifically for reporting and data analysis. Data from various sources is cleaned, transformed, and structured consistently before being loaded. This makes it easier for business analysts and data scientists to query and gain insights. Examples include:
Data Lakes: A data lake is a repository that stores vast amounts of raw data in its native format. Unlike data warehouses, which require data to be structured before loading (schema-on-write), data lakes often allow data to be loaded first and structured later when it's needed for analysis (schema-on-read). They offer flexibility but can become difficult to manage (sometimes jokingly called "data swamps") if not governed properly. Cloud storage services like Amazon S3, Azure Data Lake Storage, or Google Cloud Storage often form the basis of data lakes.
Operational Databases: Sometimes, the goal of an ETL process isn't analytics, but rather to update or enrich data in an operational system (like a CRM or an inventory database). For instance, an ETL process might pull customer data from multiple sources, clean it, and load the consolidated view back into the main customer database.
Files: Just as files can be sources, they can also be targets. An ETL process might simply clean and transform data from one set of files and output the results as new CSV or JSON files, perhaps to be consumed by another system or for simple reporting.
The following diagram illustrates the flow from common sources through an ETL process to typical targets:
Data flows from various sources like databases, files, and APIs, undergoes the ETL process, and is loaded into targets such as data warehouses, data lakes, or other systems.
Understanding these common sources and targets provides context for the "Extract" and "Load" phases of ETL, which we will examine in more detail in later chapters. The specific combination of sources and targets heavily influences the design and requirements of any ETL pipeline.
© 2025 ApX Machine Learning