Chapter 1: MPP Architecture and Storage Internals

Data warehouses designed for high-volume analytics operate on principles distinct from traditional transactional databases. When datasets exceed the capacity of a single server, performance depends on how effectively a system can distribute work. This chapter examines the mechanics of Massively Parallel Processing (MPP), where a shared-nothing architecture allows multiple compute nodes to process data segments simultaneously.

If a dataset of size $D$ is distributed across $n$ nodes, the ideal processing load for a single node approaches:

$L = \frac{D}{n}$

Achieving this theoretical efficiency in production requires a solid understanding of storage internals. We will analyze how modern platforms decouple compute from storage, allowing engineers to scale resources independently based on workload demands rather than storage capacity.

The content covers the physical organization of data, comparing row-oriented storage against columnar formats used in BigQuery, Redshift, and Snowflake. You will review how compression algorithms reduce I/O overhead and how metadata allows the query engine to ignore irrelevant data blocks through micro-partitioning. By the end of this module, you will be able to inspect storage profiles and evaluate how architectural choices directly impact query latency and cost.

Sections

1.1 Massively Parallel Processing Fundamentals
1.2 Decoupling Compute and Storage
1.3 Columnar Storage Formats and Compression
1.4 Micro-partitioning and Metadata Management
1.5 Hands-on practice: Inspecting Storage Profiles