Writing efficient SQL queries for data lakes requires a mental shift from traditional database optimization. Relational databases handle indexing and physical retrieval transparently through their storage engines. For a decoupled data lake architecture, queries must be explicitly designed to align with the underlying file organization, minimizing network I/O and decoding costs.
This practical section focuses on diagnosing query performance issues using execution plans. We will examine a common scenario: a slow aggregation query running on a large dataset. You will learn how to generate an execution plan, identify full table scans, and rewrite the query to utilize partition pruning.
Imagine a dataset of web server access logs stored in an object store (S3 or Azure Blob Storage). The data is stored in Apache Parquet format and follows a standard partitioning scheme based on the date of the event.
access_logsyear / month / dayA data analyst submits the following query to calculate the total number of hits for a specific URL during October 2023:
SELECT
url,
COUNT(*) as hit_count
FROM access_logs
WHERE event_timestamp >= TIMESTAMP '2023-10-01 00:00:00'
AND event_timestamp < TIMESTAMP '2023-11-01 00:00:00'
GROUP BY url;
Despite filtering for only one month of data, this query takes over 45 minutes to execute. To understand why, we need to inspect how the query engine translates this SQL into physical actions.
Most distributed query engines, including Trino (formerly PrestoSQL) and Apache Spark SQL, provide an EXPLAIN command. This command returns the logical or physical plan that the engine intends to execute. It does not run the query but outlines the steps involved.
To see the plan, prepend the keyword to the statement:
EXPLAIN SELECT url, COUNT(*) FROM access_logs ...
The output is typically a tree structure representing the flow of data from the bottom (storage) to the top (result). Here is a simplified representation of the plan for the slow query above:
The execution flow moves from bottom to top. The red nodes indicate expensive operations where the engine scans data before filtering it.
In the diagram above, the critical component is the TableScan at the bottom.
access_logs table.event_timestamp column, and checks if the value falls within October.The performance issue lies in the connection between the schema and the query. The table is partitioned by year, month, and day, but the query filters on event_timestamp. While these columns are semantically related, the query engine does not automatically know that event_timestamp correlates perfectly with the directory structure. Consequently, it defaults to a full table scan, reading 500 TB of metadata and footers to find the relevant rows.
To optimize this, we must explicitly trigger partition pruning. This occurs when the query engine uses the directory structure (e.g., year=2023/month=10) to skip files entirely, rather than filtering rows after reading them.
We rewrite the query to include filters on the partition columns:
SELECT
url,
COUNT(*) as hit_count
FROM access_logs
WHERE year = '2023'
AND month = '10'
AND event_timestamp >= TIMESTAMP '2023-10-01 00:00:00'
GROUP BY url;
When we run EXPLAIN on this revised query, the TableScan node changes significantly. The engine looks at the metadata catalog, identifies that only directories satisfying year=2023 and month=10 are required, and ignores the rest.
If the dataset spans 10 years, filtering down to one month reduces the input data volume by approximately a factor of 120.
The difference between a scanned filter and a pruned filter is often the difference between a failed query and a sub-minute response. The following chart compares the data scanned and execution time for the two approaches.
Comparing the resource consumption of a full scan versus a pruned scan. Note the logarithmic scale on the Y-axis; the reduction is exponential.
Modern engines often provide an EXPLAIN ANALYZE (Trino) or offer a UI (Spark History Server) that shows actual runtime statistics. When reviewing these detailed plans, focus on three specific metrics:
Input Rows vs. Output Rows:
Look at the Filter operator. If the input is 1 billion rows and the output is 1000 rows, your filter is highly selective. However, if this filtering happens after the scan, you have paid the I/O cost for 1 billion rows. The goal is to push this reduction down to the Scan level via partitions.
Splits or Tasks: A "Split" represents a unit of work, typically a single file or a block of a file assigned to a worker node.
Shuffle (Exchange) Weight:
In the GROUP BY phase, data moves between nodes (shuffling) so that all records with the same url end up on the same worker. If the execution plan shows a massive amount of data moving across the Exchange node, consider filtering the data more aggressively before the aggregation step or increasing the parallelism of the cluster.
Once partition pruning limits the scope of the query, the file format dictates the speed of the scan. Since we are using Parquet, the engine utilizes vectorized reads. Instead of deserializing one object at a time (row-oriented), the engine loads a column of integers or strings into a block of memory.
Verify in your execution plan that vectorization is enabled. In Spark, look for Batched: true in the Scan step. In Trino, this is the default behavior for Parquet and ORC.
If your query only selects specific columns (e.g., SELECT url), the columnar format allows the engine to fetch only the url chunks from storage. The event_timestamp and other unused columns are never read from the disk. This is known as Column Projection.
The combination of Partition Pruning (skipping rows based on directory) and Column Projection (skipping columns based on file structure) ensures that the query processes the absolute minimum amount of data required to answer the question.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•