Data warehouses distinguish themselves from transactional databases primarily through their storage engine architecture. While operational systems optimize for writing individual records using row-oriented storage, analytical systems must process aggregations across billions of records with low latency. This requirement necessitates a shift to columnar storage formats.In a row-oriented system (like PostgreSQL or MySQL), data is stored physically adjacent on disk by row. To retrieve a single attribute, such as the total_amount from a sales table, the storage engine must read the entire row block into memory, discarding the irrelevant fields.In a columnar format, the values for a single column are stored contiguously. This physical organization dramatically reduces the Input/Output (I/O) bandwidth required for analytical queries. If a table contains 100 columns and a query selects only 5, a columnar engine fetches only the data blocks corresponding to those 5 columns.Consider a table with $N$ rows and $k$ columns. Let $s(c_i)$ be the average size in bytes of column $i$. The total size of the table is approximately:$$ S_{total} = N \sum_{i=1}^{k} s(c_i) $$In a full table scan operation where a query requests a subset of columns $Q \subset {c_1, ..., c_k}$, the I/O cost for a row store is proportional to $S_{total}$. For a column store, the cost is proportional only to the requested data:$$ Cost_{IO} \approx N \sum_{c_j \in Q} s(c_j) $$Since analytical queries typically access less than 15% of available columns, the reduction in I/O throughput is often an order of magnitude.Storage Layout VisualizationThe following diagram contrasts how data blocks are populated in row-oriented versus columnar storage. Note how the columnar layout groups homogeneous data types, which is a critical factor for the compression techniques discussed later.digraph G { rankdir=TB; node [shape=record, style=filled, fontname="Arial", fontsize=10]; edge [color="#868e96"]; subgraph cluster_logical { label="Logical Table Data"; style=dashed; color="#adb5bd"; fontcolor="#495057"; data_record [label="{ID|1|2|3}|{Region|US|EU|US}|{Sales|100|200|150}", fillcolor="#e9ecef", color="#dee2e6"]; } subgraph cluster_row { label="Row-Oriented Storage (Page Layout)"; style=solid; color="#adb5bd"; fontcolor="#495057"; row_page1 [label="<h> Page 1 | {1 | US | 100} | {2 | EU | 200}", fillcolor="#a5d8ff", color="#74c0fc"]; row_page2 [label="<h> Page 2 | {3 | US | 150} | {...}", fillcolor="#a5d8ff", color="#74c0fc"]; } subgraph cluster_col { label="Columnar Storage (Page Layout)"; style=solid; color="#adb5bd"; fontcolor="#495057"; col_page1 [label="<h> Page 1 (ID) | 1 | 2 | 3", fillcolor="#b2f2bb", color="#8ce99a"]; col_page2 [label="<h> Page 2 (Region) | US | EU | US", fillcolor="#ffc9c9", color="#ffa8a8"]; col_page3 [label="<h> Page 3 (Sales) | 100 | 200 | 150", fillcolor="#d0bfff", color="#b197fc"]; } data_record -> row_page1 [style=invis]; row_page1 -> col_page1 [style=invis]; }Comparison of physical block placement. Row stores interleave mixed data types within a block. Column stores segregate data types into distinct blocks.Compression AlgorithmsThe second advantage of columnar storage is compressibility. Because a single physical block contains values from only one column, the data is homogeneous. Integers are stored next to integers and strings next to strings. This uniformity allows compression algorithms to achieve significantly higher ratios (often 3:1 to 10:1) compared to row-oriented blocks.Modern warehouses like Snowflake and Redshift apply different encoding algorithms automatically based on the data characteristics.Run-Length Encoding (RLE)RLE is highly effective for columns with low cardinality (few unique values) that are sorted or clustered. Instead of storing repeated values, the engine stores the value and the count of consecutive occurrences.Consider a status column sorted by state: Raw: ['ACTIVE', 'ACTIVE', 'ACTIVE', 'INACTIVE', 'INACTIVE'] RLE: [('ACTIVE', 3), ('INACTIVE', 2)]This reduces storage footprint and accelerates processing. The query execution engine can operate directly on the compressed data. For instance, a COUNT(*) operation simply sums the run lengths without decompressing the values.Dictionary EncodingFor columns with medium cardinality such as Customer_City or Product_Category, dictionary encoding replaces long strings with small integers. The engine maintains a dictionary mapping values to IDs.Raw Data: ['Laptop', 'Smartphone', 'Laptop', 'Tablet'] Dictionary: {0: 'Laptop', 1: 'Smartphone', 2: 'Tablet'} Encoded Data: [0, 1, 0, 2]This allows the CPU to process integer comparisons rather than string comparisons during filtering, which is computationally cheaper and more cache-friendly.Delta EncodingDelta encoding is used for numerical data or timestamps that increment monotonically or have small variances. Instead of storing full integers (which might require 4 or 8 bytes), the engine stores the difference between adjacent values.Raw: [1638400100, 1638400105, 1638400110] Delta (Base: 1638400100): [0, 5, 5]The deltas often fit into fewer bits than the original integers, allowing for "bit-packing" where multiple values are packed into a single standard integer word.Vectorized Processing and SIMDColumnar formats enable Vectorized Query Execution. Because data is arranged in arrays of a single type, modern CPUs can use SIMD (Single Instruction, Multiple Data) instructions.In a standard row-based iteration, the CPU processes one value at a time:# Scalar processing logic (Pseudo-code) for row in rows: if row.quantity > 10: result.add(row)In vectorized processing, the CPU loads a batch of column values into a vector register (e.g., AVX-512) and applies the comparison to all values simultaneously in a single clock cycle.$$ V_{result} = V_{quantity} > [10, 10, ..., 10] $$This hardware-level optimization reduces CPU cycles per row and improves L1/L2 cache locality, as the CPU cache is filled with relevant column data rather than irrelevant fields from a row.Impact on Query PerformanceThe combination of reduced I/O and high compression creates a non-linear improvement in query performance. However, this comes with the trade-off of expensive write operations. Modifying a single row in a columnar database typically requires rewriting multiple data blocks across the disk, which is why MPP systems favor batch ingestion over single-row inserts.The chart below demonstrates the performance divergence between row and column stores as the number of queried columns changes.{ "layout": { "title": "Query Latency: Row vs Columnar Store", "xaxis": { "title": "Number of Columns Selected (out of 100)" }, "yaxis": { "title": "Query Latency (Seconds)" }, "barmode": "group", "plot_bgcolor": "#f8f9fa", "paper_bgcolor": "#ffffff" }, "data": [ { "x": [1, 5, 10, 50, 100], "y": [0.2, 0.8, 1.5, 8.0, 15.0], "name": "Columnar Store", "type": "bar", "marker": { "color": "#339af0" } }, { "x": [1, 5, 10, 50, 100], "y": [14.5, 14.8, 15.2, 15.8, 16.5], "name": "Row Store", "type": "bar", "marker": { "color": "#fa5252" } } ] }Performance characteristics of storage engines. Columnar stores maintain low latency when fetching few columns but converge toward row-store performance as the query selects nearly all columns (reconstructing the full tuple).Metadata and Block HeadersTo manage these compressed blocks, MPP systems utilize extensive metadata headers. Each file or micro-partition contains a header section detailing the minimum and maximum values for each column within that block.When a query includes a filter predicate (e.g., WHERE order_date > '2023-01-01'), the query optimizer checks these headers before decompressing the block. If the block's max order_date is '2022-12-31', the engine skips the entire block. This mechanism, known as Block Pruning or Zone Maps, allows the system to bypass gigabytes of data entirely, further relying on the columnar structure to organize data physically in a way that aligns with access patterns.