Understanding how a database executes your SQL query is the single most effective skill for diagnosing performance issues. While SQL is a declarative language where you define what you want, the database engine determines how to retrieve it. This plan of action is called the Query Execution Plan.
In analytical workloads, the difference between a good plan and a bad one is often the difference between a query returning in three seconds or three hours. We will compare how a row-oriented system and a column-oriented system process the same aggregation query. This comparison reinforces the architectural distinctions discussed earlier in the chapter.
Imagine a retail dataset containing a sales table with 10 million records. The table schema includes order_id (integer), customer_id (integer), product_id (integer), transaction_date (date), store_id (integer), and amount (decimal).
We want to calculate the total revenue for the year 2023.
SELECT SUM(amount)
FROM sales
WHERE transaction_date >= '2023-01-01'
AND transaction_date <= '2023-12-31';
In a traditional operational database (like PostgreSQL or MySQL using standard storage engines), data is stored sequentially on disk by row. To execute the query above, the database engine typically performs a Sequential Scan (Table Scan) if no index matches the predicate perfectly or if the selectivity is low (meaning many rows match the condition).
The engine must load the page containing the full row into memory to inspect the transaction_date. Even though our query only requires transaction_date and amount, the system retrieves customer_id, product_id, and store_id as collateral overhead.
Consider the I/O implications. If the average row width is 100 bytes and we scan 10 million rows:
Dataread=10,000,000×100 bytes≈1 GB
The following diagram illustrates the access path in a row-oriented storage engine. Notice how the engine must traverse the entire width of the row structure.
Data flow in a row-store scan requires loading irrelevant columns into memory before filtering.
In a typical EXPLAIN output for a row store, you might see cost estimates that reflect this heavy I/O load:
Seq Scan on sales (cost=0.00..154321.00 rows=500000 width=6)
Filter: ((transaction_date >= '2023-01-01') AND (transaction_date <= '2023-12-31'))
The cost metric here is usually a composite unit representing disk page fetches and CPU cycles. In a row store, this cost scales linearly with the total number of columns and rows in the table, regardless of how many columns you actually requested.
Analytical warehouses (like Snowflake, BigQuery, or Redshift) store data by column. When the same query executes, the engine identifies that only the transaction_date and amount columns are required. It ignores the files or blocks containing customer_id, product_id, and store_id.
Furthermore, column stores use compression algorithms dependent on the data type. Dates, which have high cardinality but predictable formats, compress efficiently. Integers and decimals compress even better.
If the column data is compressed by a factor of 10:1, and we only read 2 columns (assuming 8 bytes each):
Dataread=10,000,000×(8+8) bytes=160 MB
With compression, this might drop to 16-20 MB of actual disk I/O. This is orders of magnitude less data transfer than the row-oriented approach.
Additionally, analytical engines use Zone Maps or Min/Max pruning. The metadata for a block of data often stores the minimum and maximum values for columns within that block. If a block's metadata says the transaction_date range is 2021-01-01 to 2021-06-01, the engine skips that block entirely because it cannot contain values from 2023.
Columnar execution skips irrelevant data blocks (pruning) and reads only necessary vertical slices.
The following chart highlights the dramatic difference in data scanning requirements between the two architectures for this specific aggregation query.
Volume of data read from disk for a simple aggregation query on a 10M row dataset.
When working with analytical databases, you should look for specific indicators in the query plan that confirm the system is utilizing columnar optimizations.
Partitions Scanned equals Total Partitions, the engine performed a full scan and failed to use metadata pruning.SELECT clause rarely changes the I/O cost significantly (since the whole row is read anyway). In a column store, adding columns linearly increases I/O.The physical design of the schema directly influences these query plans. In the upcoming chapters, when we discuss Partitioning and Clustering, we are essentially organizing data on disk to maximize the effectiveness of the prune-and-scan mechanism demonstrated here.
By reading the query plan, you verify whether your data model successfully uses the underlying storage engine. If you design a Star Schema but the query plan shows full table scans on dimension joins without pruning, the model requires physical tuning.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with