Modern data lake architectures decouple storage from compute, allowing each layer to scale independently. While this flexibility lowers costs, it introduces latency. To counter this, distributed query engines like Trino and Apache Spark must process data with extreme efficiency once it is loaded from object storage. After pruning partitions and skipping files, the engine loads the remaining data into memory. At this stage, the bottleneck shifts from network I/O to CPU utilization.
Vectorized query execution is the architectural strategy used to maximize CPU throughput. Instead of processing data one row at a time, the engine processes batches of values organized by column. This approach aligns software execution with modern hardware capabilities, specifically CPU caching hierarchies and Single Instruction, Multiple Data (SIMD) instructions.
To understand the necessity of vectorization, we must first examine the traditional method of query processing, known as the Volcano Model or the Iterator Model. In this standard implementation, every operator in a query plan (such as Filter, Project, or Join) acts as an iterator. The engine repeatedly calls a function, often named next(), to fetch a single tuple (row) from the child operator, process it, and pass it to the parent operator.
While the Volcano Model is simple to implement and efficient for row-oriented transaction processing, it creates significant overhead for analytical workloads in a data lake.
price and tax columns, the CPU must load the entire row into the cache line, polluting the L1/L2 cache with irrelevant data (like customer_description or address).Vectorized execution inverts this approach. Instead of a next() call returning a single row, it returns a batch (or vector) of columnar data, typically 1024 or 4096 values at a time. The processing loop iterates over this tight array of primitives (integers, floats, or booleans).
Comparison of execution models. The Volcano model incurs overhead per item, whereas the Vectorized model amortizes overhead across a batch of column values.
The primary hardware advantage of vectorized execution is its ability to utilize SIMD instructions. Modern CPUs (x86 with AVX2/AVX-512 or ARM with NEON) contain registers capable of holding multiple data points simultaneously.
Consider a simple aggregation query: SELECT SUM(price) FROM sales.
In a scalar (non-vectorized) approach, the CPU adds the first price to the accumulator, then the second, and so on. Each addition requires a separate instruction cycle.
With SIMD, the CPU loads four (or more) 32-bit integers into a single 128-bit register. A single CPU instruction adds these four values to an accumulator register in parallel.
This parallelism reduces the number of CPU cycles required by a factor proportional to the vector width. For data lakes storing billions of records, this results in order-of-magnitude improvements in scan and aggregation speeds.
Vectorized execution works synergistically with columnar storage formats like Apache Parquet. Since Parquet already stores data physically in columns, the query engine can read data directly from the disk into memory vectors with minimal transformation.
If the data were stored in a row-oriented format (like Avro or CSV), the engine would have to:
With Parquet and a vectorized reader, the engine performs a direct memory mapping or a streamlined copy from the file buffer to the execution vector. Formats like Apache Arrow provide a standardized in-memory columnar format that allows different systems to exchange these vectors without serialization overhead.
To visualize the implementation difference, consider how a filter operation works in code. We want to filter prices greater than 100.
Scalar Implementation (Python-like pseudocode):
# High overhead: Type checking and function calls happen
# inside the loop for every single item.
results = []
for price in price_column:
if price > 100:
results.append(price)
Vectorized Implementation:
# Low overhead: The type check happens once.
# The loop runs at C-level speed, often unrolled by the compiler.
# Modern engines use SIMD to compare chunks of 'price_array' against 100 simultaneous.
mask = price_array > 100
results = price_array[mask]
In the vectorized version, the interpreter overhead (or the query planner overhead) occurs only once per batch, rather than once per row.
Vectorization also addresses the memory wall, the gap between CPU speed and memory retrieval speed. By iterating over contiguous blocks of memory (a column vector), the engine exhibits excellent spatial locality.
When the CPU fetches the first value of a vector, the memory controller pulls a cache line (typically 64 bytes) containing the subsequent values into the L1 cache. Subsequent iterations of the loop consume data already present in the L1 cache, preventing expensive stalls waiting for data from the main RAM.
While vectorization is the default for most modern data lake engines, certain scenarios force the engine to fall back to row-by-row processing:
Engineers analyzing query plans using EXPLAIN commands should look for indications of vectorized processing. In Spark SQL, for example, the physical plan might show Batched: true or reference VectorizedParquetRecordReader. If these indicators are missing on large table scans, it suggests a configuration issue or an unsupported data type is preventing the engine from running at full speed.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with