Alright, we've discussed how data gets processed, differentiating between batch jobs that run on schedules and stream processing that handles data continuously. But what actually does the processing? Transforming raw data, whether in large batches or real-time streams, requires computational power. Think of it like needing an engine to move a vehicle. Data processing needs its own kind of engine, which we refer to as compute resources.
At its heart, data processing involves calculations, data movement, and temporary storage. The primary components providing this power are:
Sometimes, especially in machine learning or complex simulations (which often follow data engineering steps), another component becomes significant:
Not all data processing tasks are created equal. Analyzing a small daily sales report requires far fewer resources than processing terabytes of sensor data from thousands of devices. This is where the idea of scaling comes in. There are two primary ways to increase your processing capacity:
A diagram showing two approaches to handling a workload: Scaling Up uses one powerful machine, while Scaling Out uses multiple coordinated machines.
Batch processing often benefits from scaling out, allowing massive datasets to be processed in parallel across many machines within a reasonable time frame. Stream processing might also scale out to handle high volumes of incoming data, ensuring continuous processing without delays.
Traditionally, organizations bought and maintained their own physical servers (known as on-premises infrastructure). Today, it's increasingly common to rent compute resources from cloud providers like Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure.
Cloud platforms offer significant flexibility:
We'll look more closely at cloud platforms in Chapter 6. For now, understand that processing data requires underlying compute resources (CPU, RAM), and you need ways to scale these resources appropriately for your specific tasks, whether using your own hardware or leveraging the cloud. Managing these resources efficiently is a core part of data engineering, ensuring that data can be processed reliably and cost-effectively.
© 2025 ApX Machine Learning