Chapter 5: Querying and Performance

After establishing storage layouts and ingestion pipelines, the architecture must support efficient data retrieval. Storing data in optimized formats like Parquet or Avro provides the foundation, but the performance of analytical workloads depends heavily on how the query engine interacts with the object storage layer. Unlike tightly coupled databases where the engine controls the physical disk, data lakes require the query planner to navigate network latency and distributed file systems.

This section examines the mechanisms used by distributed SQL engines, such as Trino and Apache Spark, to execute high-performance queries on decoupled storage. We analyze how Massively Parallel Processing (MPP) architectures break down SQL statements into tasks distributed across a cluster of worker nodes.

You will learn technical strategies to minimize I/O operations, which is the primary cost driver in cloud analytics. We discuss partition pruning and file skipping, where the engine utilizes metadata statistics, min, max, and null counts, to ignore irrelevant files completely. For example, without pruning, a query operates with a time complexity proportional to the total data volume, or $O(n)$ . By applying metadata filters, the engine limits the scan to relevant partitions, significantly reducing the input size.

The curriculum also covers:

Vectorized Execution: processing data in columnar batches to utilize modern CPU instruction sets efficiently.
Caching Hierarchies: implementing SSD or memory caching to accelerate repetitive access patterns.
Query Plan Analysis: reading EXPLAIN output to diagnose bottlenecks and verify that partition schemes are functioning as intended.

Sections

5.1 Distributed Query Engines
5.2 File Pruning and Skipping
5.3 Vectorized Query Execution
5.4 Caching Strategies
5.5 Hands-on Practical: Query Analysis