Building a scalable data lake requires a specific set of structural design choices that differ significantly from traditional relational database management systems. While databases typically bundle storage and processing power together, data lake architectures prioritize flexibility and cost-efficiency by breaking these components apart. This chapter establishes the technical groundwork for how these systems operate, from the physical layout of files to the logical organization of tables.
We begin by analyzing the separation of compute and storage. This design allows you to scale retention independently of processing power, optimizing costs where . You will examine the mechanics of this separation and why it has become the standard for cloud-based analytics.
Following the component structure, we address the specific behavior of object storage systems like Amazon S3, Azure Blob Storage, and Google Cloud Storage. These systems do not behave like the POSIX file systems found on local disks. We will discuss the implications of eventual consistency and object immutability, explaining why operations like renaming a directory are resource-intensive in a cloud environment.
The discussion then moves to logical data organization. We introduce the Medallion architecture, a layered approach that organizes data quality into three distinct stages:
Finally, we compare the Lambda and Kappa processing architectures. You will evaluate the trade-offs between maintaining separate batch and speed layers versus implementing a single stream-processing path. By the end of this chapter, you will be able to select the appropriate structural patterns for specific data engineering requirements.
1.1 Decoupling Compute and Storage
1.2 Object Storage Semantics
1.3 The Medallion Architecture
1.4 Lambda and Kappa Architectures
© 2026 ApX Machine LearningEngineered with