The journey of data through a pipeline begins with extraction. Think of it as the first step in getting raw materials from their origin to a factory. Before any cleaning, shaping, or combining can happen, you need to get the data out of wherever it currently lives. Data resides in many different places, and the method used to retrieve it depends heavily on the source system.
This section covers common techniques data engineers use to extract data, forming the crucial first stage ('E' in ETL and ELT) of your data pipeline.
Extracting from Databases
Databases are structured repositories of information, and getting data out often involves speaking their language.
-
Relational Databases (SQL): These databases, like PostgreSQL, MySQL, or SQL Server, use Structured Query Language (SQL). Extraction typically involves writing SQL SELECT
statements to specify which tables, columns, and rows you need.
- Full Table Extraction: Sometimes, you might need the entire contents of a table. This is straightforward but can be resource-intensive for large tables.
SELECT * FROM customers;
- Incremental Extraction: Often, you only need data that has changed or been added since the last extraction. This is more efficient. Common methods include:
-
NoSQL Databases: These databases (like MongoDB, Cassandra, Couchbase) store data in formats other than traditional rows and columns (e.g., documents, key-value pairs). They have their own query languages or APIs. Extraction might involve:
- Using the database's specific query language (e.g., MQL for MongoDB).
- Utilizing provided client libraries in programming languages like Python or Java to fetch data programmatically.
- Similar to SQL databases, you can perform full extractions or implement incremental logic based on timestamps or other markers if the data model supports it.
Fetching Data from APIs
Many web services and applications expose their data through Application Programming Interfaces (APIs). Think of an API as a waiter in a restaurant; you make a request (ask for data), and the API delivers the response (the data itself).
- How it Works: Typically, you send an HTTP request (often a
GET
request) to a specific URL endpoint provided by the API. The API processes the request and sends back data, commonly in formats like JSON or XML.
- Authentication: Most APIs require some form of authentication to identify and authorize who is making the request. This often involves sending an API key or token along with your request.
- Parameters: You can often customize the data you receive by including parameters in your request URL, such as specifying date ranges, filtering criteria, or the number of results per page (pagination).
- Example: Requesting user data might look like sending a GET request to
https://api.example.com/v1/users?status=active&page=1
.
- Rate Limiting: APIs often have limits on how many requests you can make within a certain time period to prevent abuse. Your extraction process needs to respect these limits, perhaps by adding pauses between requests.
Reading Data from Files
Data is frequently stored in files, residing on local disks, network shares, or cloud storage systems.
- File Types: Common formats include CSV (Comma Separated Values), JSON (JavaScript Object Notation), Parquet, Avro, XML, and plain text log files.
- Location:
- Local/Network File Systems: Your pipeline might need to read files directly from the server it runs on or from a mounted network drive.
- Cloud Object Storage: Services like Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage are very common for storing large volumes of data files. Extraction involves using cloud provider tools or libraries to access and download these files.
- Extraction Logic:
- Full File Read: Read the entire content of one or more files.
- Incremental Reading: Process only new files that have appeared in a directory since the last run, often identified by filename patterns or timestamps.
- Reading Specific Parts: For some formats (like Parquet), it's possible to efficiently read only specific columns or sections of the file without loading the whole thing into memory.
Data flows from various sources, through specific extraction techniques, into the next stage of the pipeline.
Subscribing to Streaming Data
Unlike the batch-oriented methods above, some data arrives as a continuous stream of events. Think of social media feeds, sensor readings, or application logs generated in real-time.
- Sources: Platforms like Apache Kafka, Google Cloud Pub/Sub, or Amazon Kinesis are designed to handle these streams.
- How it Works: Instead of periodically requesting data, your pipeline subscribes to a data stream or topic. As new data events arrive at the source, they are pushed to your pipeline almost immediately.
- Considerations: Extracting from streams requires different tools and architectural patterns compared to batch extraction, focusing on processing individual events or small micro-batches as they arrive. This course focuses primarily on batch pipelines (ETL/ELT), but it's useful to know that stream processing exists for real-time needs.
Choosing the Right Technique
The best extraction method depends on:
- The Data Source: You use SQL for relational databases, API calls for web services, file readers for files, etc.
- Data Volume: Extracting gigabytes from a database requires a different approach than fetching a few kilobytes from an API.
- Frequency: How often do you need the data? Every few seconds (streaming), every hour (mini-batch), or once a day (batch)?
- Data Format: The structure (or lack thereof) influences how data is parsed after extraction.
- Available Tools: The specific software and libraries you have access to will shape implementation.
Extraction is the gateway for data into your pipeline. By understanding these fundamental techniques for retrieving data from databases, APIs, files, and streams, you're equipped to build the first essential component of robust data systems. The next steps typically involve transforming this raw extracted data or loading it into a target system.