The performance of a query is often determined less by how fast you can process data and more by how much data you can avoid processing entirely. Data transfer over the network is the primary bottleneck in decoupled storage architectures. Every byte read from object storage incurs latency and cost. Therefore, the most effective optimization strategy is to ensure the query engine only retrieves the specific files containing the required rows.
Two primary mechanisms achieve this reduction: Partition Pruning and File Skipping. While they share the same goal of reducing I/O, they operate at different granularities and stages of the query planning process.
Partition pruning is the first line of defense. It operates at the directory level (or the logical partition level in a metastore). When you define a table partitioned by a specific column, such as date or region, the physical data is organized into hierarchical directories.
When a user submits a query with a WHERE clause filtering on a partition column, the query planner inspects the directory structure or the partition list in the catalog. It immediately discards any paths that do not match the filter criteria. This operation happens during the planning phase, often before any data files are listed or touched on the storage layer.
Consider a dataset partitioned by year and month. If a query requests data for January 2023, the engine effectively ignores all other years and months.
The query engine isolates the target path
year=2023/month=01, bypassing unrelated directories entirely to reduce metadata overhead and scanning.
In this scenario, the engine does not need to list the files in year=2022. This reduces the number of API calls to the object store (such as S3 LIST requests) and prevents the engine from scheduling tasks for irrelevant data.
Partition pruning is coarse-grained. It works well for high-level filters but fails to help when queries filter on non-partition columns. For example, if you query for customer_id = 105 inside the January 2023 partition, partition pruning will narrow the scope to that month, but the engine might still have to scan hundreds of files within that folder.
This is where File Skipping (also known as Data Skipping) becomes critical. This technique uses metadata statistics stored within the file footers (in Parquet or ORC) or in the table manifest (in Iceberg or Delta Lake).
The most common statistics used for skipping are:
When a query engine evaluates a file, it looks at the WHERE clause and compares it against the file's metadata. It applies a logical test to see if the file could possibly contain the data.
For a query filtering on WHERE price > 100:
The engine performs this check using negligible CPU resources compared to the cost of downloading and decompressing the file.
Visualization of file ranges. The red line represents a query value. Only files with ranges (blue bars) intersecting the query value are scanned; others (grey bars) are skipped.
File skipping relies entirely on the data distribution. If your data is randomly distributed across files, the Min/Max ranges for every file will effectively cover the entire domain of values. For instance, if customer_ids from 1 to 1,000,000 are randomly scattered, every file might have a Min near 1 and a Max near 1,000,000. In this case, no files can be skipped because every file might contain the requested ID.
To maximize the efficiency of file skipping, data engineers must cluster or sort data during ingestion.
customer_id) before writing, you ensure that File 1 contains IDs 1–1000, File 2 contains 1001–2000, and so on. This creates distinct, non-overlapping ranges.customer_id AND transaction_date), simple linear sorting is insufficient. Z-Ordering (or Hilbert curves) is a technique used by modern formats like Delta Lake and Iceberg to colocate related information in multidimensional space, allowing efficient skipping on multiple attributes.Legacy data lake approaches required the query engine to open the footer of every single Parquet file to read these statistics. While better than reading the whole file, this still requires many GET requests to object storage just to read headers.
Modern Open Table Formats (OTFs) like Apache Iceberg improve this by elevating statistics to the Manifest level. The manifest is a metadata file that lists the data files along with their upper and lower bounds.
When using an OTF, the query engine reads the manifest file first. It performs the filtering logic purely on the metadata in memory. It produces a list of target files and only then communicates with the object storage to retrieve the actual data. This separation decouples the planning phase from the storage layer, significantly reducing latency for large tables.
We can estimate the impact of pruning and skipping using a simplified cost model. Let be the total data volume and be the scan throughput of the engine.
Without pruning, the query time is roughly linear to the dataset size:
With partition pruning (filtering down to 10% of data) and file skipping (skipping 80% of the remaining files based on stats), the effective scanned volume becomes:
The query now runs 50 times faster, not because the engine is faster, but because it processes 98% less data. This reduction directly translates to lower cloud costs and faster insights.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with