Writing efficient SQL queries for data lakes requires a mental shift from traditional database optimization. Relational databases handle indexing and physical retrieval transparently through their storage engines. For a decoupled data lake architecture, queries must be explicitly designed to align with the underlying file organization, minimizing network I/O and decoding costs.This practical section focuses on diagnosing query performance issues using execution plans. We will examine a common scenario: a slow aggregation query running on a large dataset. You will learn how to generate an execution plan, identify full table scans, and rewrite the query to utilize partition pruning.The ScenarioImagine a dataset of web server access logs stored in an object store (S3 or Azure Blob Storage). The data is stored in Apache Parquet format and follows a standard partitioning scheme based on the date of the event.Table Name: access_logsFormat: ParquetPartition Strategy: year / month / dayTotal Size: 500 TBFile Count: 200,000 objectsA data analyst submits the following query to calculate the total number of hits for a specific URL during October 2023:SELECT url, COUNT(*) as hit_count FROM access_logs WHERE event_timestamp >= TIMESTAMP '2023-10-01 00:00:00' AND event_timestamp < TIMESTAMP '2023-11-01 00:00:00' GROUP BY url;Despite filtering for only one month of data, this query takes over 45 minutes to execute. To understand why, we need to inspect how the query engine translates this SQL into physical actions.Generating and Reading an Execution PlanMost distributed query engines, including Trino (formerly PrestoSQL) and Apache Spark SQL, provide an EXPLAIN command. This command returns the logical or physical plan that the engine intends to execute. It does not run the query but outlines the steps involved.To see the plan, prepend the keyword to the statement:EXPLAIN SELECT url, COUNT(*) FROM access_logs ...The output is typically a tree structure representing the flow of data from the bottom (storage) to the top (result). Here is a simplified representation of the plan for the slow query above:digraph G { rankdir=TB; node [shape=box, style=filled, fontname="Arial", fontsize=10]; edge [fontname="Arial", fontsize=10]; Output [label="Output: Result Set", fillcolor="#b2f2bb"]; Exchange [label="Exchange: Gather Results", fillcolor="#a5d8ff"]; Aggregation [label="Aggregation: Group By (url)", fillcolor="#a5d8ff"]; Filter [label="Filter: event_timestamp >= ...", fillcolor="#ffc9c9"]; Scan [label="TableScan: access_logs\n(scans all partitions)", fillcolor="#ffc9c9"]; Output -> Exchange; Exchange -> Aggregation; Aggregation -> Filter; Filter -> Scan; }The execution flow moves from bottom to top. The red nodes indicate expensive operations where the engine scans data before filtering it.In the diagram above, the critical component is the TableScan at the bottom.TableScan: The engine lists all files in the access_logs table.Filter: The engine opens every Parquet file, reads the event_timestamp column, and checks if the value falls within October.Aggregation: Matches are passed up for counting.The performance issue lies in the connection between the schema and the query. The table is partitioned by year, month, and day, but the query filters on event_timestamp. While these columns are semantically related, the query engine does not automatically know that event_timestamp correlates perfectly with the directory structure. Consequently, it defaults to a full table scan, reading 500 TB of metadata and footers to find the relevant rows.Implementing Partition PruningTo optimize this, we must explicitly trigger partition pruning. This occurs when the query engine uses the directory structure (e.g., year=2023/month=10) to skip files entirely, rather than filtering rows after reading them.We rewrite the query to include filters on the partition columns:SELECT url, COUNT(*) as hit_count FROM access_logs WHERE year = '2023' AND month = '10' AND event_timestamp >= TIMESTAMP '2023-10-01 00:00:00' GROUP BY url;When we run EXPLAIN on this revised query, the TableScan node changes significantly. The engine looks at the metadata catalog, identifies that only directories satisfying year=2023 and month=10 are required, and ignores the rest.If the dataset spans 10 years, filtering down to one month reduces the input data volume by approximately a factor of 120.$$ \text{Input Size}_{\text{optimized}} \approx \frac{\text{Total Size}}{120} $$Quantifying the ImpactThe difference between a scanned filter and a pruned filter is often the difference between a failed query and a sub-minute response. The following chart compares the data scanned and execution time for the two approaches.{"layout": {"title": "Impact of Partition Pruning on Query Performance", "xaxis": {"title": "Metric"}, "yaxis": {"title": "Value (Log Scale)", "type": "log"}, "barmode": "group", "plot_bgcolor": "#f8f9fa", "paper_bgcolor": "#f8f9fa", "font": {"family": "Arial, sans-serif"}}, "data": [{"x": ["Data Scanned (GB)", "Execution Time (Seconds)"], "y": [512000, 2700], "name": "Full Scan (Unpruned)", "type": "bar", "marker": {"color": "#ff8787"}}, {"x": ["Data Scanned (GB)", "Execution Time (Seconds)"], "y": [4200, 45], "name": "Partition Pruned", "type": "bar", "marker": {"color": "#37b24d"}}]}Comparing the resource consumption of a full scan versus a pruned scan. Note the logarithmic scale on the Y-axis; the reduction is exponential.Analyzing the Physical Plan StatisticsModern engines often provide an EXPLAIN ANALYZE (Trino) or offer a UI (Spark History Server) that shows actual runtime statistics. When reviewing these detailed plans, focus on three specific metrics:Input Rows vs. Output Rows: Look at the Filter operator. If the input is 1 billion rows and the output is 1000 rows, your filter is highly selective. However, if this filtering happens after the scan, you have paid the I/O cost for 1 billion rows. The goal is to push this reduction down to the Scan level via partitions.Splits or Tasks: A "Split" represents a unit of work, typically a single file or a block of a file assigned to a worker node.High Split Count, Low Data Volume: Indicates the Small File Problem. Reading 10,000 files that are 1MB each is slower than reading 10 files that are 1GB each due to the overhead of opening connections and reading headers.Low Split Count, Slow Execution: Indicates Data Skew. One specific file or partition might be significantly larger than others, causing one worker node to struggle while others sit idle.Shuffle (Exchange) Weight: In the GROUP BY phase, data moves between nodes (shuffling) so that all records with the same url end up on the same worker. If the execution plan shows a massive amount of data moving across the Exchange node, consider filtering the data more aggressively before the aggregation step or increasing the parallelism of the cluster.Vectorization and File FormatsOnce partition pruning limits the scope of the query, the file format dictates the speed of the scan. Since we are using Parquet, the engine utilizes vectorized reads. Instead of deserializing one object at a time (row-oriented), the engine loads a column of integers or strings into a block of memory.Verify in your execution plan that vectorization is enabled. In Spark, look for Batched: true in the Scan step. In Trino, this is the default behavior for Parquet and ORC.If your query only selects specific columns (e.g., SELECT url), the columnar format allows the engine to fetch only the url chunks from storage. The event_timestamp and other unused columns are never read from the disk. This is known as Column Projection.The combination of Partition Pruning (skipping rows based on directory) and Column Projection (skipping columns based on file structure) ensures that the query processes the absolute minimum amount of data required to answer the question.