Chapter 2: Advanced Data Modeling at Scale

Data modeling in Massively Parallel Processing (MPP) systems requires shifting priorities from storage conservation to compute optimization and flexibility. Traditional approaches like Third Normal Form (3NF) often incur significant performance penalties due to extensive data shuffling across distributed nodes. Even the standard Star Schema, while effective for the presentation layer, can become rigid when handling frequent structural changes or high-velocity ingestion rates in a petabyte-scale environment.

This chapter examines architectural patterns designed to support scalability and iterative development. You will first evaluate the specific limitations of dimensional modeling within distributed storage environments. The content then moves to Data Vault 2.0, where you will learn to construct Hubs, Links, and Satellites to decouple business keys from their descriptive attributes. We also cover technical methods for ingesting and querying semi-structured data formats, such as JSON and Parquet, utilizing native SQL extensions rather than external ETL processes. The section concludes with techniques for managing schema evolution, allowing your warehouse to adapt to source system changes without breaking existing data contracts.

Sections

2.1 Dimensional Modeling Constraints in Big Data
2.2 Data Vault 2.0 Implementation Patterns
2.3 Handling Semi-Structured Data
2.4 Schema Evolution and Versioning
2.5 Hands-on practice: Designing a Data Vault