Okay, now that we have a basic understanding of different data formats like structured, semi-structured, and unstructured data, let's figure out where this data actually comes from and how data engineers typically gather it. Identifying the origin and the method of collection is often the very first step when designing systems to store and process data. Think of it like knowing where your ingredients come from before you start cooking.
Data can originate from a surprisingly diverse set of places. As a data engineer, you'll encounter many of these regularly. Let's look at some common categories:
Operational Databases: These are the workhorses behind many applications. They store the current state of business operations. Examples include:
Logs: Almost every computer system generates logs. These are records of events that occurred, often used for troubleshooting, monitoring performance, or understanding usage patterns. Examples include:
User Activity and Interaction Data: Modern applications, especially web and mobile apps, generate vast amounts of data based on user behavior. This includes:
APIs (Application Programming Interfaces): APIs are like contracts that allow different software systems to request and exchange data in a predefined way. Data engineers use APIs to fetch data from:
Files: Sometimes, data simply arrives as files. This is common for:
Streaming Sources: Unlike data sitting in databases or files, some data arrives as a continuous flow, or stream. Think of:
Knowing the source is half the battle; the other half is knowing how to collect the data. The method often depends on the source:
Database Queries: For operational databases, the most direct method is often querying them using their native language, typically SQL for relational databases or specific query languages for NoSQL databases. Data might be extracted in bulk periodically (e.g., every night) or more frequently.
Log Shipping / Aggregation: Collecting logs from potentially hundreds or thousands of servers requires specialized tools. Log shippers (like Fluentd, Filebeat, or cloud-native agents) run on servers, read log files as they are written, and forward the log events to a central storage or processing system.
API Calls: To get data from an API, your system needs to make a network request (often an HTTP GET request) to a specific URL endpoint provided by the API. The API then sends back the requested data, usually in a format like JSON. Collection might involve calling the API on a schedule (e.g., every hour) to get updates. Practical aspects include handling authentication (proving you have permission to access the API) and respecting rate limits (how many requests you can make in a given time).
File Transfer / Ingestion: When data comes as files, collection might involve:
Streaming Ingestion: Collecting data from streaming sources typically involves using message queues or dedicated streaming platforms (like Apache Kafka, Google Pub/Sub, or AWS Kinesis). Producers (the sources) send data continuously to these platforms, and consumers (your data processing applications) read from them in real-time or near real-time.
Change Data Capture (CDC): Instead of repeatedly querying an entire database table to find changes, CDC techniques focus on capturing only the changes (inserts, updates, deletes) as they happen in the source database, often by reading the database's transaction log. This can be much more efficient for keeping downstream systems synchronized. While the implementation details can be complex, it's useful to know this approach exists.
Web Scraping: This involves writing programs (scrapers) to automatically browse websites and extract information directly from HTML pages. While sometimes necessary if no API is available, it should be approached with caution. Websites change frequently, breaking scrapers, and scraping can raise ethical and legal questions or violate a site's terms of service. It's generally less reliable and less preferred than using a formal API.
Here's a diagram illustrating how different sources are typically accessed:
Different data sources often require specific collection methods. Databases are typically queried, logs are shipped, APIs are called, files are transferred, and streams are ingested.
Understanding these sources and collection methods is fundamental. When you need to build a data pipeline (which we'll cover in Chapter 3), your first questions will often be: "Where does the data live?" and "How can we get it?". Having this map of possibilities allows you to choose the right tools and techniques for the job. Next, we'll dive a bit deeper into one of the most common data storage structures: databases.
© 2025 ApX Machine Learning