The performance of a query is often determined less by how fast you can process data and more by how much data you can avoid processing entirely. Data transfer over the network is the primary bottleneck in decoupled storage architectures. Every byte read from object storage incurs latency and cost. Therefore, the most effective optimization strategy is to ensure the query engine only retrieves the specific files containing the required rows.Two primary mechanisms achieve this reduction: Partition Pruning and File Skipping. While they share the same goal of reducing I/O, they operate at different granularities and stages of the query planning process.Partition PruningPartition pruning is the first line of defense. It operates at the directory level (or the logical partition level in a metastore). When you define a table partitioned by a specific column, such as date or region, the physical data is organized into hierarchical directories.When a user submits a query with a WHERE clause filtering on a partition column, the query planner inspects the directory structure or the partition list in the catalog. It immediately discards any paths that do not match the filter criteria. This operation happens during the planning phase, often before any data files are listed or touched on the storage layer.Consider a dataset partitioned by year and month. If a query requests data for January 2023, the engine effectively ignores all other years and months.digraph G { rankdir=TB; node [fontname="Helvetica,Arial,sans-serif", style=filled, color="#adb5bd"]; edge [color="#868e96"]; subgraph cluster_0 { label = ""; color = white; root [label="s3://data/sales/", shape=folder, fillcolor="#e9ecef"]; y2022 [label="year=2022", shape=folder, fillcolor="#dee2e6", fontcolor="#868e96"]; y2023 [label="year=2023", shape=folder, fillcolor="#a5d8ff", penwidth=2, color="#339af0"]; y2024 [label="year=2024", shape=folder, fillcolor="#dee2e6", fontcolor="#868e96"]; m01 [label="month=01", shape=folder, fillcolor="#a5d8ff", penwidth=2, color="#339af0"]; m02 [label="month=02", shape=folder, fillcolor="#dee2e6", fontcolor="#868e96"]; f1 [label="data.parquet", shape=note, fillcolor="#4dabf7", fontcolor="white"]; root -> y2022 [style=dashed]; root -> y2023 [color="#228be6", penwidth=2]; root -> y2024 [style=dashed]; y2023 -> m01 [color="#228be6", penwidth=2]; y2023 -> m02 [style=dashed]; m01 -> f1 [color="#228be6", penwidth=2]; } }The query engine isolates the target path year=2023/month=01, bypassing unrelated directories entirely to reduce metadata overhead and scanning.In this scenario, the engine does not need to list the files in year=2022. This reduces the number of API calls to the object store (such as S3 LIST requests) and prevents the engine from scheduling tasks for irrelevant data.File Skipping and StatisticsPartition pruning is coarse-grained. It works well for high-level filters but fails to help when queries filter on non-partition columns. For example, if you query for customer_id = 105 inside the January 2023 partition, partition pruning will narrow the scope to that month, but the engine might still have to scan hundreds of files within that folder.This is where File Skipping (also known as Data Skipping) becomes critical. This technique uses metadata statistics stored within the file footers (in Parquet or ORC) or in the table manifest (in Iceberg or Delta Lake).The most common statistics used for skipping are:Minimum Value: The smallest value for a column in the file (or row group).Maximum Value: The largest value for a column in the file.Null Count: Whether the column contains null values.How Skipping WorksWhen a query engine evaluates a file, it looks at the WHERE clause and compares it against the file's metadata. It applies a logical test to see if the file could possibly contain the data.For a query filtering on WHERE price > 100:File A (Min: 10, Max: 50): The range [10, 50] does not overlap with > 100. Skip.File B (Min: 40, Max: 120): The range [40, 120] overlaps with > 100. Read.File C (Min: 150, Max: 200): The range [150, 200] is fully contained. Read.The engine performs this check using negligible CPU resources compared to the cost of downloading and decompressing the file.{ "layout": { "title": "File Skipping Logic based on Column Ranges", "xaxis": { "title": "Column Value Range", "showgrid": true, "zeroline": false }, "yaxis": { "showticklabels": false, "title": "Files" }, "shapes": [ { "type": "line", "x0": 65, "x1": 65, "y0": 0, "y1": 1, "xref": "x", "yref": "paper", "line": { "color": "#fa5252", "width": 4, "dash": "dot" } } ], "annotations": [ { "x": 65, "y": 1, "xref": "x", "yref": "paper", "text": "Query: value = 65", "showarrow": true, "arrowhead": 2, "ax": 0, "ay": -30, "font": {"color": "#fa5252"} } ], "margin": {"t": 40, "l": 50, "r": 20, "b": 40}, "height": 300 }, "data": [ { "type": "bar", "y": ["File 1", "File 2", "File 3"], "x": [20, 30, 20], "base": [10, 40, 80], "orientation": "h", "marker": { "color": ["#dee2e6", "#339af0", "#dee2e6"] }, "text": ["Range: 10-30 (Skip)", "Range: 40-70 (Scan)", "Range: 80-100 (Skip)"], "textposition": "auto", "hoverinfo": "text" } ] }Visualization of file ranges. The red line represents a query value. Only files with ranges (blue bars) intersecting the query value are scanned; others (grey bars) are skipped.The Importance of Data LayoutFile skipping relies entirely on the data distribution. If your data is randomly distributed across files, the Min/Max ranges for every file will effectively cover the entire domain of values. For instance, if customer_ids from 1 to 1,000,000 are randomly scattered, every file might have a Min near 1 and a Max near 1,000,000. In this case, no files can be skipped because every file might contain the requested ID.To maximize the efficiency of file skipping, data engineers must cluster or sort data during ingestion.Sorting: By sorting data on frequently queried columns (like customer_id) before writing, you ensure that File 1 contains IDs 1–1000, File 2 contains 1001–2000, and so on. This creates distinct, non-overlapping ranges.Z-Ordering: For skipping on multiple columns (e.g., customer_id AND transaction_date), simple linear sorting is insufficient. Z-Ordering (or Hilbert curves) is a technique used by modern formats like Delta Lake and Iceberg to colocate related information in multidimensional space, allowing efficient skipping on multiple attributes.Skipping in Open Table FormatsLegacy data lake approaches required the query engine to open the footer of every single Parquet file to read these statistics. While better than reading the whole file, this still requires many GET requests to object storage just to read headers.Modern Open Table Formats (OTFs) like Apache Iceberg improve this by elevating statistics to the Manifest level. The manifest is a metadata file that lists the data files along with their upper and lower bounds.When using an OTF, the query engine reads the manifest file first. It performs the filtering logic purely on the metadata in memory. It produces a list of target files and only then communicates with the object storage to retrieve the actual data. This separation decouples the planning phase from the storage layer, significantly reducing latency for large tables.Efficiency MathWe can estimate the impact of pruning and skipping using a simplified cost model. Let $N$ be the total data volume and $S$ be the scan throughput of the engine.Without pruning, the query time $T$ is roughly linear to the dataset size: $$T \approx \frac{N}{S}$$With partition pruning (filtering down to 10% of data) and file skipping (skipping 80% of the remaining files based on stats), the effective scanned volume $N_{effective}$ becomes: $$N_{effective} = N \times 0.10 \times 0.20 = 0.02N$$The query now runs 50 times faster, not because the engine is faster, but because it processes 98% less data. This reduction directly translates to lower cloud costs and faster insights.