Standard dimensional modeling places heavy emphasis on the end-user experience, prioritizing simplified query logic through denormalization. However, in a distributed MPP environment processing petabytes of data, the maintenance overhead of Type 2 Slowly Changing Dimensions (SCD) can become a bottleneck. Updating millions of rows to close distinct record versions triggers massive data shuffling across compute nodes. Data Vault 2.0 addresses this by decoupling business keys, relationships, and descriptive attributes into distinct structures. This separation enables 100% parallel loading, eliminates ingestion dependencies, and supports agile schema evolution without refactoring existing tables.The Component ArchitectureData Vault 2.0 relies on three primary structures: Hubs, Links, and Satellites. Unlike the star schema which centers on a fact table, the Data Vault centers on the business key.HubsA Hub contains only the unique list of business keys for a core business entity (e.g., Customer, Product, Invoice). It does not contain descriptive data or relationships. The primary goal of the Hub is to establish a distinct list of identifiers that other structures can reference.In traditional warehousing, surrogate keys are often sequential integers generated by the database. In MPP systems, generating sequential IDs prevents parallel loading because the loader must wait for the centralized counter. Data Vault 2.0 replaces sequential IDs with Hash Keys.The Hash Key is a deterministic hash (typically MD5 or SHA-256) of the business key.$$ K_{hash} = f_{hash}(\text{TRIM}(\text{UPPER}(K_{business}))) $$This allows multiple compute nodes to calculate keys independently without coordination, enabling linear scalability during ingestion.LinksLinks represent the associations between Hubs. They resolve many-to-many relationships physically, regardless of the logical cardinality. A Link table contains its own Hash Key, the Hash Keys of the Hubs it connects, and the Load Date. By externalizing relationships into Links, you can add new relationships without altering the Hubs or existing Satellites.SatellitesSatellites store the descriptive attributes (context) and history. A Satellite is attached to either a Hub or a Link. It contains the parent's Hash Key, the Load Date, a Hash Diff (for change detection), and the actual data columns.When a source record changes, you do not update the existing row. Instead, you insert a new row into the Satellite with a new Load Date. This "insert-only" architecture is highly efficient on log-structured merge trees and columnar storage engines common in Snowflake and BigQuery.digraph G { rankdir=LR; node [shape=box, style="filled, rounded", fontname="Helvetica", fontsize=10, width=1.5]; edge [color="#adb5bd", arrowsize=0.7]; subgraph cluster_0 { label="Customer Domain"; style=dashed; color="#dee2e6"; h_cust [label="Hub Customer\n(Business Key)", fillcolor="#4dabf7", fontcolor="white", color="#4dabf7"]; s_cust [label="Sat Customer Details\n(Name, Address)", fillcolor="#b2f2bb", fontcolor="#2b8a3e", color="#b2f2bb"]; } subgraph cluster_1 { label="Order Domain"; style=dashed; color="#dee2e6"; h_ord [label="Hub Order\n(Order ID)", fillcolor="#4dabf7", fontcolor="white", color="#4dabf7"]; s_ord [label="Sat Order Metrics\n(Amount, Tax)", fillcolor="#b2f2bb", fontcolor="#2b8a3e", color="#b2f2bb"]; } l_ord_cust [label="Link Order-Customer\n(Relationship)", fillcolor="#ff8787", fontcolor="white", color="#ff8787"]; h_cust -> l_ord_cust; h_ord -> l_ord_cust; s_cust -> h_cust; s_ord -> h_ord; }Structure of a basic Data Vault model showing the relationship between Hubs (blue), Links (red), and Satellites (green).Parallel Ingestion and Hash DiffsThe scalability of Data Vault comes from its loading pattern. Because Hash Keys are deterministic, you can load Hubs, Links, and Satellites in parallel streams. The specific challenge in high-volume environments is Change Data Capture (CDC).To avoid looking up the previous record value to detect changes (which requires an expensive self-join or window function), Data Vault uses a Hash Diff. The Hash Diff is calculated by concatenating all descriptive columns in the source feed and hashing the result.During the staging process, you calculate the Hash Diff for incoming records. The loading logic becomes a simple set comparison:Calculate HashDiff_Incoming.Left join the staging table to the current Satellite on Hub_Hash_Key.Filter where Satellite.Hash_Key is NULL (new parent) OR Staging.HashDiff != Satellite.HashDiff (changed data).Insert resulting records.This comparison uses the MPP engine's ability to perform massive hash joins efficiently.Satellite Splitting for PerformanceA common anti-pattern in early Data Vault implementations is grouping all attributes of a business entity into a single Satellite. In petabyte-scale warehouses, this leads to excessive I/O scanning.Attributes often change at different velocities. For example, a customer's LoyaltyTier might change once a year, while their LastLoginDate changes daily. If stored in the same Satellite, every login event necessitates duplicating the static LoyaltyTier data in a new row, bloating storage and slowing down historical queries.Split Satellites based on:Rate of Change: Group high-frequency columns together and low-frequency columns together.Source System: Isolate data by source (e.g., Sat_Cust_Salesforce, Sat_Cust_SAP). This prevents a delay in one source system from blocking the loading of another.Data Classification: Separate PII (Personally Identifiable Information) into a secured Satellite with restricted access policies (Row-Level Security).Query Performance and Point-in-Time (PIT) TablesWhile Data Vault excels at writing, the high degree of normalization requires many joins to reconstruct a "flat" view of the data. Joining a Hub to five different Satellites and two Links can degrade query latency.To resolve this for the consumption layer, we implement Point-in-Time (PIT) tables. A PIT table is a bridge table that pre-calculates the valid joins for specific time snapshots.A PIT table structure typically includes:The Hub Hash Key.The Snapshot Date.Foreign keys pointing to the valid record in each surrounding Satellite for that date.By computing these temporal alignments during the batch window (or continuously), downstream queries become simple equi-joins on primary keys, enabling the MPP optimizer to use partition pruning effectively.{"layout": {"title": {"text": "Query Latency: Raw Data Vault vs. PIT Table Optimization", "font": {"size": 14}}, "xaxis": {"title": "Number of Joined Satellites"}, "yaxis": {"title": "Query Execution Time (Seconds)"}, "width": 600, "height": 400, "margin": {"l": 50, "r": 20, "t": 50, "b": 50}}, "data": [{"type": "scatter", "mode": "lines+markers", "name": "Raw Vault Joins", "x": [1, 2, 3, 4, 5, 6], "y": [1.2, 2.5, 4.8, 8.1, 14.5, 25.2], "line": {"color": "#fa5252"}}, {"type": "scatter", "mode": "lines+markers", "name": "PIT Table Joins", "x": [1, 2, 3, 4, 5, 6], "y": [0.8, 1.1, 1.3, 1.5, 1.8, 2.1], "line": {"color": "#228be6"}}]}Comparison of query latency scaling when joining multiple satellites directly versus using a pre-computed Point-in-Time (PIT) structure.Handling Schema EvolutionSchema drift is inevitable. In a Star Schema, adding a column usually requires an ALTER TABLE command on a potentially massive fact table, which can lock resources. In Data Vault, schema evolution is handled by adding new Satellites.If a source system adds five new fields describing a customer:Option A: Add columns to the existing Satellite (requires ALTER TABLE, but generally safe in Snowflake/BigQuery as it's a metadata operation).Option B: Create a new vertical Satellite (e.g., Sat_Customer_v2) linked to the same Hub.Option B is preferred when the new data arrives from a distinct micro-service or has a different loading cadence. The Hub remains untouched. The Link remains untouched. Downstream views are updated to include the new columns, but the physical ingestion pipelines for existing data remain uninterrupted. This additive nature makes Data Vault particularly effective for agile engineering teams managing continuous integration pipelines.