Modern data lake architectures decouple storage from compute, allowing each layer to scale independently. While this flexibility lowers costs, it introduces latency. To counter this, distributed query engines like Trino and Apache Spark must process data with extreme efficiency once it is loaded from object storage. After pruning partitions and skipping files, the engine loads the remaining data into memory. At this stage, the bottleneck shifts from network I/O to CPU utilization.Vectorized query execution is the architectural strategy used to maximize CPU throughput. Instead of processing data one row at a time, the engine processes batches of values organized by column. This approach aligns software execution with modern hardware capabilities, specifically CPU caching hierarchies and Single Instruction, Multiple Data (SIMD) instructions.The Volcano Model vs VectorizationTo understand the necessity of vectorization, we must first examine the traditional method of query processing, known as the Volcano Model or the Iterator Model. In this standard implementation, every operator in a query plan (such as Filter, Project, or Join) acts as an iterator. The engine repeatedly calls a function, often named next(), to fetch a single tuple (row) from the child operator, process it, and pass it to the parent operator.While the Volcano Model is simple to implement and efficient for row-oriented transaction processing, it creates significant overhead for analytical workloads in a data lake.Instruction Overhead: For a query processing one billion rows, the engine triggers at least one billion virtual function calls per operator. The CPU spends substantial cycles managing the function call stack rather than executing the actual data logic.Cache Misses: In a row-based layout, data for different columns is interleaved in memory. If a query only requires the price and tax columns, the CPU must load the entire row into the cache line, polluting the L1/L2 cache with irrelevant data (like customer_description or address).Vectorized execution inverts this approach. Instead of a next() call returning a single row, it returns a batch (or vector) of columnar data, typically 1024 or 4096 values at a time. The processing loop iterates over this tight array of primitives (integers, floats, or booleans).digraph G { rankdir=LR; node [fontname="Sans-Serif", shape=box, style=filled, color="#dee2e6"]; edge [color="#868e96"]; subgraph cluster_volcano { label="Volcano Model (Row-at-a-Time)"; style=dashed; color="#adb5bd"; row_input [label="Input: 1 Row", fillcolor="#a5d8ff"]; func_call [label="Function Call overhead", fillcolor="#ffc9c9"]; cpu_exec_1 [label="CPU: Process Row", fillcolor="#dee2e6"]; row_input -> func_call -> cpu_exec_1; } subgraph cluster_vector { label="Vectorized Model (Batch-at-a-Time)"; style=dashed; color="#adb5bd"; vec_input [label="Input: 4096 Column Values", fillcolor="#a5d8ff"]; cpu_exec_2 [label="CPU: Process Batch\n(Tight Loop)", fillcolor="#63e6be"]; simd [label="SIMD Instructions", fillcolor="#b197fc"]; vec_input -> cpu_exec_2 -> simd; } }Comparison of execution models. The Volcano model incurs overhead per item, whereas the Vectorized model amortizes overhead across a batch of column values.Exploiting SIMD InstructionsThe primary hardware advantage of vectorized execution is its ability to utilize SIMD instructions. Modern CPUs (x86 with AVX2/AVX-512 or ARM with NEON) contain registers capable of holding multiple data points simultaneously.Consider a simple aggregation query: SELECT SUM(price) FROM sales.In a scalar (non-vectorized) approach, the CPU adds the first price to the accumulator, then the second, and so on. Each addition requires a separate instruction cycle.$$ \text{Total} = P_1 + P_2 + P_3 + P_4 $$With SIMD, the CPU loads four (or more) 32-bit integers into a single 128-bit register. A single CPU instruction adds these four values to an accumulator register in parallel.$$ \text{Total}_{vec} = \begin{bmatrix} P_1 \ P_2 \ P_3 \ P_4 \end{bmatrix} + \text{Accumulator} $$This parallelism reduces the number of CPU cycles required by a factor proportional to the vector width. For data lakes storing billions of records, this results in order-of-magnitude improvements in scan and aggregation speeds.Columnar Format IntegrationVectorized execution works synergistically with columnar storage formats like Apache Parquet. Since Parquet already stores data physically in columns, the query engine can read data directly from the disk into memory vectors with minimal transformation.If the data were stored in a row-oriented format (like Avro or CSV), the engine would have to:Read the row.Parse the specific fields.Pivot the data into columnar memory vectors.With Parquet and a vectorized reader, the engine performs a direct memory mapping or a streamlined copy from the file buffer to the execution vector. Formats like Apache Arrow provide a standardized in-memory columnar format that allows different systems to exchange these vectors without serialization overhead.Code Example: Scalar vs Vectorized LogicTo visualize the implementation difference, consider how a filter operation works in code. We want to filter prices greater than 100.Scalar Implementation (Python-like pseudocode):# High overhead: Type checking and function calls happen # inside the loop for every single item. results = [] for price in price_column: if price > 100: results.append(price)Vectorized Implementation:# Low overhead: The type check happens once. # The loop runs at C-level speed, often unrolled by the compiler. # Modern engines use SIMD to compare chunks of 'price_array' against 100 simultaneous. mask = price_array > 100 results = price_array[mask]In the vectorized version, the interpreter overhead (or the query planner overhead) occurs only once per batch, rather than once per row.Memory Bandwidth and CachingVectorization also addresses the memory wall, the gap between CPU speed and memory retrieval speed. By iterating over contiguous blocks of memory (a column vector), the engine exhibits excellent spatial locality.When the CPU fetches the first value of a vector, the memory controller pulls a cache line (typically 64 bytes) containing the subsequent values into the L1 cache. Subsequent iterations of the loop consume data already present in the L1 cache, preventing expensive stalls waiting for data from the main RAM.Limitations and FallbacksWhile vectorization is the default for most modern data lake engines, certain scenarios force the engine to fall back to row-by-row processing:Complex UDFs: If a query involves a User Defined Function (UDF) written in a language that does not support vectorization (e.g., a standard Python UDF in Spark without Arrow optimization), the engine must deserialize the vector into rows to feed the function.Row-Oriented Operations: Operations that are inherently row-based, such as complicated regular expression matching on unstructured text or operations on deeply nested JSON structures, may not benefit significantly from vectorization.Engineers analyzing query plans using EXPLAIN commands should look for indications of vectorized processing. In Spark SQL, for example, the physical plan might show Batched: true or reference VectorizedParquetRecordReader. If these indicators are missing on large table scans, it suggests a configuration issue or an unsupported data type is preventing the engine from running at full speed.