Structured data is perhaps the most common type you'll encounter when building ETL pipelines. Think of it as information organized neatly into rows and columns, much like a spreadsheet or a table in a database. This predictable format, often defined by a clear schema (a blueprint describing the data structure), makes it relatively straightforward to extract.
We'll look at two primary sources of structured data: relational databases and comma-separated value (CSV) files.
Relational databases (like PostgreSQL, MySQL, SQL Server) store data in tables with predefined columns and data types. The standard language for interacting with these databases is SQL (Structured Query Language). Extraction typically involves writing SQL queries to retrieve the necessary data.
The most fundamental SQL command for extraction is SELECT
. In its simplest form, you can retrieve all data from a table:
SELECT * FROM customers;
The asterisk (*
) is a wildcard meaning "all columns". However, retrieving all columns is often inefficient, especially for large tables. It's better practice to specify only the columns you need for your ETL process:
SELECT customer_id, first_name, last_name, email, registration_date
FROM customers;
Often, you don't need all the rows either. You might only need customers who registered after a certain date, or customers located in a specific region. The WHERE
clause in SQL allows you to filter rows based on specific conditions:
SELECT customer_id, first_name, email
FROM customers
WHERE country = 'USA' AND registration_date >= '2023-01-01';
This query selects the ID, first name, and email for customers from the USA who registered on or after January 1st, 2023.
Executing these queries requires a connection to the database (as discussed in "Connecting to Data Sources") using appropriate credentials and connection details. The result of the query is typically a tabular dataset that your ETL tool or script can then process.
CSV files are simple text files that represent tabular data. Each line in the file typically corresponds to a row, and values within a row are separated by a delimiter, most commonly a comma. Often, the first line contains header names for the columns.
Here’s a small example of a products.csv
file:
ProductID,ProductName,Category,Price
101,Laptop,Electronics,1200.00
102,Coffee Maker,Home Goods,85.50
103,Notebook,Stationery,2.99
104,Desk Chair,Furniture,150.00
Extracting data from CSV files involves reading the file line by line and parsing each line based on the delimiter. While you could write code to do this manually, most programming languages and ETL tools have built-in libraries or components specifically designed for reading CSV files efficiently.
These tools handle common complexities:
\t
), semicolons (;
), or pipes (|
) as separators. The extraction process needs to know which delimiter to use."
). The parser needs to handle this correctly.When setting up extraction from a CSV file, you typically need to configure these parameters (file path, delimiter, header presence, encoding) so the reader can interpret the file correctly.
Diagram illustrating the extraction flow from a relational database using SQL and from a CSV file using a parser.
Understanding how to query databases and parse structured files like CSVs forms the bedrock of many data extraction tasks. While the tools and specific commands might vary, the underlying principles of identifying, selecting, and retrieving data in a tabular format remain consistent.
© 2025 ApX Machine Learning