A Data Vault model requires a fundamental shift in how relationships and data storage are perceived in a warehouse. Unlike the star schema, which pre-joins data for read optimization, Data Vault optimizes for write flexibility and auditability. A Raw Vault for an e-commerce platform is constructed. This system must ingest data from a CRM (Customer data), an ERP (Order processing), and a PIM (Product Information Management) system.The objective is to build a schema that supports high-velocity ingestion and allows the source structures to evolve without breaking the downstream data warehouse. We will focus on the three core entities: Hubs (business keys), Links (relationships), and Satellites (context/attributes).Defining the ArchitectureIn an MPP environment like Snowflake or BigQuery, we avoid using sequences or auto-incrementing integers for surrogate keys because they require a central coordination point, which becomes a bottleneck during parallel loading. Instead, Data Vault 2.0 relies on cryptographic hashes (MD5 or SHA-256) of business keys. This allows deterministic key generation across distributed nodes without looking up existing records.The following diagram outlines the relationship structure we will implement. Notice how the business keys are isolated in Hubs, while the temporal context resides in Satellites.digraph G { rankdir=TB; node [shape=box, style=filled, fontname="Arial", fontsize=10]; edge [fontname="Arial", fontsize=8, color="#868e96"]; /* Hubs - Blue */ subgraph cluster_hubs { label="Hubs (Business Keys)"; style=dashed; color="#adb5bd"; H_CUST [label="HUB_CUSTOMER\n(Email/CustID)", fillcolor="#a5d8ff", color="#1c7ed6"]; H_ORD [label="HUB_ORDER\n(OrderNo)", fillcolor="#a5d8ff", color="#1c7ed6"]; H_PROD [label="HUB_PRODUCT\n(SKU)", fillcolor="#a5d8ff", color="#1c7ed6"]; } /* Links - Violet */ subgraph cluster_links { label="Links (Transactions/Associations)"; style=dashed; color="#adb5bd"; L_ORD_CUST [label="LINK_ORDER_CUSTOMER", fillcolor="#d0bfff", color="#7048e8"]; L_ORD_LINE [label="LINK_ORDER_LINE", fillcolor="#d0bfff", color="#7048e8"]; } /* Satellites - Teal */ subgraph cluster_sats { label="Satellites (Context & History)"; style=dashed; color="#adb5bd"; S_CUST_CRM [label="SAT_CUST_CRM\n(Name, Address)", fillcolor="#96f2d7", color="#0ca678"]; S_ORD_DTL [label="SAT_ORDER_DETAILS\n(Status, Date)", fillcolor="#96f2d7", color="#0ca678"]; } /* Connections */ H_CUST -> L_ORD_CUST; H_ORD -> L_ORD_CUST; H_ORD -> L_ORD_LINE; H_PROD -> L_ORD_LINE; H_CUST -> S_CUST_CRM [style=dotted, label="Descriptive"]; H_ORD -> S_ORD_DTL [style=dotted, label="Descriptive"]; }The diagram separates structural components (Hubs and Links) from descriptive components (Satellites). Hubs connect to Links to form the skeleton, while Satellites attach to Hubs or Links to provide details.Implementing HubsThe Hub is the entry point for a business object. It contains only the Hash Key, the Load Date, the Record Source, and the original Business Key. This table should rarely change.For our Customer entity, the Business Key is the Customer_ID. We generate the Hub_Customer_HK (Hash Key) by hashing the trimmed, upper-cased Business Key.$$H_{key} = SHA256(UPPER(TRIM(BusinessKey)))$$We choose SHA-256 for its low collision probability, which is necessary when processing terabytes of data.SQL Implementation for HUB_CUSTOMER:CREATE TABLE HUB_CUSTOMER ( HUB_CUSTOMER_HK BINARY(32) NOT NULL, -- The Primary Key CUSTOMER_ID VARCHAR(100) NOT NULL, -- The Business Identifier LOAD_DT TIMESTAMP_NTZ NOT NULL, RECORD_SOURCE VARCHAR(50) NOT NULL, CONSTRAINT PK_HUB_CUSTOMER PRIMARY KEY (HUB_CUSTOMER_HK) );When loading this table, you use a DISTINCT select from your staging area. Because the hash is deterministic, you can perform an anti-join (LEFT JOIN ... WHERE target.HK IS NULL) to insert only new business keys.Implementing LinksLinks represent associations or transactions. In our scenario, an order links a customer to a transaction. The LINK_ORDER_CUSTOMER table resolves the many-to-many or one-to-many relationships between HUB_ORDER and HUB_CUSTOMER.A Link table contains its own Hash Key, the Hash Keys of the referenced Hubs, and the audit metadata.SQL Implementation for LINK_ORDER_CUSTOMER:CREATE TABLE LINK_ORDER_CUSTOMER ( LNK_ORD_CUST_HK BINARY(32) NOT NULL, HUB_ORDER_HK BINARY(32) NOT NULL, HUB_CUSTOMER_HK BINARY(32) NOT NULL, LOAD_DT TIMESTAMP_NTZ NOT NULL, RECORD_SOURCE VARCHAR(50) NOT NULL, CONSTRAINT PK_LNK_ORD_CUST PRIMARY KEY (LNK_ORD_CUST_HK), CONSTRAINT FK_LNK_ORD FOREIGN KEY (HUB_ORDER_HK) REFERENCES HUB_ORDER (HUB_ORDER_HK), CONSTRAINT FK_LNK_CUST FOREIGN KEY (HUB_CUSTOMER_HK) REFERENCES HUB_CUSTOMER (HUB_CUSTOMER_HK) );To generate the LNK_ORD_CUST_HK, you concatenate the business keys of the participating Hubs (in a sorted order to ensure consistency) and hash the result.$$H_{link} = SHA256(Sort(Key_{Order}, Key_{Customer}))$$Implementing SatellitesSatellites store the state of data over time. They are the only place where mutable attributes (like customer address or order status) reside. A Satellite must have a parent (usually a Hub) and uses the parent's Hash Key as its primary identifier, combined with the Load Date.To detect changes efficiently in an MPP system, we calculate a HashDiff. This is a hash of all descriptive columns in the row. During the load process, we compare the incoming HashDiff with the most recent HashDiff in the target Satellite. If they differ, a new row is inserted. This eliminates expensive column-by-column comparisons.SQL Implementation for SAT_CUSTOMER_DETAILS:CREATE TABLE SAT_CUSTOMER_CRM ( HUB_CUSTOMER_HK BINARY(32) NOT NULL, LOAD_DT TIMESTAMP_NTZ NOT NULL, HASH_DIFF BINARY(32) NOT NULL, -- Used for Change Detection FIRST_NAME VARCHAR(100), LAST_NAME VARCHAR(100), EMAIL VARCHAR(255), ADDRESS_JSON VARIANT, -- Storing semi-structured address data RECORD_SOURCE VARCHAR(50) NOT NULL, CONSTRAINT PK_SAT_CUST PRIMARY KEY (HUB_CUSTOMER_HK, LOAD_DT), CONSTRAINT FK_SAT_HUB FOREIGN KEY (HUB_CUSTOMER_HK) REFERENCES HUB_CUSTOMER (HUB_CUSTOMER_HK) );The HashDiff CalculationThe HASH_DIFF ensures we only store actual changes. The formula applies to all descriptive columns:$$H_{diff} = SHA256(COALESCE(Col_1, '') || '|' || COALESCE(Col_2, '') || ...)$$We use a delimiter (like |) and handle NULL values to ensure that NULL + A is treated differently than A + NULL.Loading Pattern and ParallelismThe design we just implemented enables a high degree of parallelism.Staging: Raw data lands in a staging table.Hash Generation: We calculate all Hash Keys and Hash Diffs in a view on top of the staging table.Hub/Link Load: We load Hubs and Links in parallel. Since Links only depend on Hub keys (which we generated in staging), we do not technically need to wait for Hubs to finish loading to prepare the Link data, though referential integrity constraints in the database might require the Hub commit first.Satellite Load: Satellites are loaded by comparing the Staging HashDiff against the current Satellite HashDiff.This INSERT-ONLY architecture creates a full audit trail. We never update rows. To get the current state of a customer, you query the Satellite for the row with the maximum LOAD_DT for a specific HUB_CUSTOMER_HK.Visualizing the Storage ImpactIn a columnar storage system, this decomposition (normalization) might seem excessive compared to a wide table. However, MPP systems compress columns individually. The repetition of Hash Keys in Satellites compresses extremely well (Run-Length Encoding), and the separation prevents locking.The following chart displays the storage distribution typically seen in a mature Data Vault. Satellites consume the majority of storage due to historical tracking, while Hubs and Links remain lightweight indexing structures.{"layout": {"title": {"text": "Typical Storage Distribution in Data Vault", "font": {"size": 16}}, "showlegend": true, "legend": {"orientation": "h", "y": -0.1}, "margin": {"t": 40, "b": 20, "l": 40, "r": 40}, "height": 350}, "data": [{"values": [15, 25, 60], "labels": ["Hubs (Keys)", "Links (Relationships)", "Satellites (History & Attributes)"], "type": "pie", "marker": {"colors": ["#1c7ed6", "#7048e8", "#0ca678"]}, "textinfo": "label+percent", "hole": 0.4}]}Satellites inevitably grow the largest as they retain the full history of attribute changes. Hubs and Links stabilize once the core business entities and relationships are established.By implementing this structure, you ensure that if the CRM adds a new field to the customer profile, you simply add a column to the Satellite (or create a new Satellite) without rebuilding the Hubs or Links. This adaptability is the primary engineering advantage of Data Vault in petabyte-scale warehouses.