Data lakes are built on top of object storage, where data exists as immutable files rather than mutable database rows. While the architectural separation of compute and storage provides scalability, the physical layout of these files dictates query latency and cost. If a query engine must parse terabytes of text-based logs to calculate a simple average, the system becomes inefficient. This chapter focuses on the storage layer mechanics that optimize these read patterns.
We begin by distinguishing between row-oriented formats, such as CSV and Avro, and columnar formats like Apache Parquet and ORC. You will examine why analytical workloads, which typically aggregate specific metrics across massive datasets, perform better when data is stored by column rather than by row. The curriculum covers the internal structure of Apache Parquet, detailing how encoding schemes like Run-Length Encoding (RLE) and dictionary compression reduce the storage footprint.
The discussion then moves to Open Table Formats. Standard file collections lack the transaction guarantees found in traditional databases. We examine how specifications like Apache Iceberg, Delta Lake, and Apache Hudi introduce a metadata layer to support ACID transactions and snapshot isolation. You will also learn to select appropriate compression algorithms, balancing the CPU cost of decompression against the network savings of smaller file sizes.
Finally, we address partitioning strategies. An effective partition scheme allows the query engine to target only relevant data subsets, a process known as partition pruning. For instance, a query filtering for and can skip scanning the majority of the lake if the directory structure reflects these keys. By the end of this chapter, you will be able to configure file layouts that minimize I/O and maximize throughput.
2.1 Row-Oriented vs Columnar Storage
2.2 Apache Parquet Internals
2.3 Open Table Formats
2.4 Compression Algorithms
2.5 Partitioning Strategies
2.6 Hands-on Practical: Optimizing File Layouts