Symmetric Multi-Processing (SMP) architectures, where multiple processors share a single operating system and memory bus, inevitably reach a point of diminishing returns. As you add CPUs to a monolithic server, bus contention for memory access increases, eventually stalling throughput regardless of compute power. To process petabyte-scale datasets within acceptable latency windows, we must abandon shared resources in favor of Massively Parallel Processing (MPP).MPP systems operate on a "shared-nothing" architecture. In this model, the database is partitioned across multiple nodes, each possessing its own independent CPU, memory, and locally attached storage (or a dedicated path to remote object storage). A central leader node orchestrates the cluster, but the heavy lifting of filtering, joining, and aggregating data occurs on the worker nodes.The architectural advantage of MPP lies in its ability to scale horizontally. Doubling the number of nodes theoretically doubles the processing power and memory capacity, provided the workload is evenly distributed.The Shared-Nothing TopologyIn a shared-nothing cluster, the system physically segments the data. When a query arrives, the leader node parses the SQL, creates an execution plan, and pushes fragments of that plan to the worker nodes. The workers execute these fragments against their local data slice and return only the results (or intermediate data) to the leader or other workers.This reduces the movement of raw data across the network. By pushing the computation to the data rather than pulling data to the computation, MPP systems minimize I/O overhead, which is typically the primary bottleneck in analytical processing.digraph G { rankdir=TB; bgcolor="#ffffff"; node [fontname="Sans-Serif", shape=rect, style=filled, penwidth=0]; edge [fontname="Sans-Serif", color="#adb5bd", penwidth=1.5]; subgraph cluster_leader { label = ""; style = invis; Leader [label="Leader / Coordinator\n(Query Planning & Parsing)", fillcolor="#748ffc", fontcolor="white", width=3]; } subgraph cluster_interconnect { label = ""; style = invis; Switch [label="High-Speed Interconnect / Network Fabric", shape=hexagon, fillcolor="#ced4da", fontcolor="#495057", width=4]; } subgraph cluster_workers { label = ""; style = invis; rank=same; W1 [label="Compute Node 1\n(Local RAM + CPU)", fillcolor="#228be6", fontcolor="white"]; W2 [label="Compute Node 2\n(Local RAM + CPU)", fillcolor="#228be6", fontcolor="white"]; W3 [label="Compute Node 3\n(Local RAM + CPU)", fillcolor="#228be6", fontcolor="white"]; } subgraph cluster_storage { label = ""; style = invis; rank=same; S1 [label="Storage Slice A", fillcolor="#e9ecef", fontcolor="#495057"]; S2 [label="Storage Slice B", fillcolor="#e9ecef", fontcolor="#495057"]; S3 [label="Storage Slice C", fillcolor="#e9ecef", fontcolor="#495057"]; } Leader -> Switch; Switch -> W1; Switch -> W2; Switch -> W3; W1 -> S1 [dir=both]; W2 -> S2 [dir=both]; W3 -> S3 [dir=both]; }Architecture of a standard shared-nothing MPP system illustrating the separation of coordination and distributed execution.Deterministic Data DistributionFor the leader node to assign work efficiently, it must know exactly where specific rows reside without scanning every node. This is achieved through deterministic distribution, typically using a hashing algorithm.When you define a distribution key (or cluster key) on a table, the system applies a hash function to that column's value. The result determines the destination node (or micro-partition).$$ H(k) \mod n = \text{Node ID} $$Where $H$ is the hash function, $k$ is the distribution key value, and $n$ is the number of nodes.If a user joins two tables on the same key, and both tables are distributed by that key, the join can occur locally on each node without network traffic. This is often called a co-located join. If the tables are distributed differently, the system must perform a "shuffle," redistributing rows across the network to align the keys before the join can proceed.The Skew ProblemThe theoretical linear scaling of MPP relies on the assumption that data, and therefore work, is distributed evenly. In production environments, data is rarely uniform. Certain values (e.g., NULL or a specific "Unknown" category) may appear disproportionately often.If Node 1 receives 10 million rows while Nodes 2 through 10 receive only 1 million rows each, Node 1 becomes a straggler. In an MPP system, a query is only as fast as the slowest worker node.The total execution time $T_{job}$ is defined by the maximum time taken by any single node:$$ T_{job} = \max(T_{node_1}, T_{node_2}, ..., T_{node_n}) + T_{coordination} $$Skew creates a "hotspot" that wastes the resources of the idle nodes waiting for the straggler to finish. Detecting skew requires analyzing the storage profile of your tables, a technique we will practice later in this module.Amdahl's Law and Network OverheadWhile adding nodes increases raw compute power, it also introduces communication overhead. The interconnect layer, the network fabric connecting the nodes, has finite bandwidth. As the cluster size grows, the complexity of coordinating transactions and shuffling data increases.This limitation is described by Amdahl's Law, which states that the potential speedup of a task is limited by the portion of the task that cannot be parallelized (the serial portion). In data warehousing, the serial portion includes query parsing, final result aggregation at the leader node, and network latency.{ "layout": { "title": "Theoretical vs. Actual MPP Scaling", "xaxis": { "title": "Number of Nodes", "showgrid": false }, "yaxis": { "title": "Speedup Factor", "showgrid": true }, "plot_bgcolor": "#ffffff", "showlegend": true, "legend": {"x": 0.05, "y": 1}, "margin": {"t": 50, "l": 50, "r": 20, "b": 50} }, "data": [ { "x": [1, 2, 4, 8, 16, 32, 64], "y": [1, 2, 4, 8, 16, 32, 64], "type": "scatter", "mode": "lines+markers", "name": "Linear Scaling (Ideal)", "line": { "color": "#20c997", "width": 3 } }, { "x": [1, 2, 4, 8, 16, 32, 64], "y": [1, 1.9, 3.6, 6.8, 12.5, 22, 38], "type": "scatter", "mode": "lines+markers", "name": "Actual Scaling (With Overhead)", "line": { "color": "#fd7e14", "width": 3 } } ] }Divergence between ideal linear scaling and actual performance due to network overhead and serial processing constraints.Elasticity and Cloud-Native MPPTraditional on-premise MPP appliances (like early Netezza or Teradata boxes) coupled storage and compute tightly. If you ran out of disk space, you had to buy more nodes, paying for compute power you didn't need.Modern cloud platforms (Snowflake, BigQuery, Redshift RA3) have evolved the MPP architecture by separating compute from storage. In these systems:Storage is offloaded to remote object stores (S3, GCS, Azure Blob). This layer scales infinitely and cheaply.Compute consists of stateless clusters (virtual warehouses) that can be spun up or down in seconds.This separation introduces a new architectural consideration: Data Locality. Since compute nodes do not permanently store the data, they must fetch it from the remote object store over the network. To mitigate the latency penalty of remote fetch, modern engines rely heavily on aggressive local SSD caching and columnar storage formats, which we will examine in the next section.