A Data Vault model requires a fundamental shift in how relationships and data storage are perceived in a warehouse. Unlike the star schema, which pre-joins data for read optimization, Data Vault optimizes for write flexibility and auditability. A Raw Vault for an e-commerce platform is constructed. This system must ingest data from a CRM (Customer data), an ERP (Order processing), and a PIM (Product Information Management) system.
The objective is to build a schema that supports high-velocity ingestion and allows the source structures to evolve without breaking the downstream data warehouse. We will focus on the three core entities: Hubs (business keys), Links (relationships), and Satellites (context/attributes).
In an MPP environment like Snowflake or BigQuery, we avoid using sequences or auto-incrementing integers for surrogate keys because they require a central coordination point, which becomes a bottleneck during parallel loading. Instead, Data Vault 2.0 relies on cryptographic hashes (MD5 or SHA-256) of business keys. This allows deterministic key generation across distributed nodes without looking up existing records.
The following diagram outlines the relationship structure we will implement. Notice how the business keys are isolated in Hubs, while the temporal context resides in Satellites.
The diagram separates structural components (Hubs and Links) from descriptive components (Satellites). Hubs connect to Links to form the skeleton, while Satellites attach to Hubs or Links to provide details.
The Hub is the entry point for a business object. It contains only the Hash Key, the Load Date, the Record Source, and the original Business Key. This table should rarely change.
For our Customer entity, the Business Key is the Customer_ID. We generate the Hub_Customer_HK (Hash Key) by hashing the trimmed, upper-cased Business Key.
We choose SHA-256 for its low collision probability, which is necessary when processing terabytes of data.
SQL Implementation for HUB_CUSTOMER:
CREATE TABLE HUB_CUSTOMER (
HUB_CUSTOMER_HK BINARY(32) NOT NULL, -- The Primary Key
CUSTOMER_ID VARCHAR(100) NOT NULL, -- The Business Identifier
LOAD_DT TIMESTAMP_NTZ NOT NULL,
RECORD_SOURCE VARCHAR(50) NOT NULL,
CONSTRAINT PK_HUB_CUSTOMER PRIMARY KEY (HUB_CUSTOMER_HK)
);
When loading this table, you use a DISTINCT select from your staging area. Because the hash is deterministic, you can perform an anti-join (LEFT JOIN ... WHERE target.HK IS NULL) to insert only new business keys.
Links represent associations or transactions. In our scenario, an order links a customer to a transaction. The LINK_ORDER_CUSTOMER table resolves the many-to-many or one-to-many relationships between HUB_ORDER and HUB_CUSTOMER.
A Link table contains its own Hash Key, the Hash Keys of the referenced Hubs, and the audit metadata.
SQL Implementation for LINK_ORDER_CUSTOMER:
CREATE TABLE LINK_ORDER_CUSTOMER (
LNK_ORD_CUST_HK BINARY(32) NOT NULL,
HUB_ORDER_HK BINARY(32) NOT NULL,
HUB_CUSTOMER_HK BINARY(32) NOT NULL,
LOAD_DT TIMESTAMP_NTZ NOT NULL,
RECORD_SOURCE VARCHAR(50) NOT NULL,
CONSTRAINT PK_LNK_ORD_CUST PRIMARY KEY (LNK_ORD_CUST_HK),
CONSTRAINT FK_LNK_ORD FOREIGN KEY (HUB_ORDER_HK) REFERENCES HUB_ORDER (HUB_ORDER_HK),
CONSTRAINT FK_LNK_CUST FOREIGN KEY (HUB_CUSTOMER_HK) REFERENCES HUB_CUSTOMER (HUB_CUSTOMER_HK)
);
To generate the LNK_ORD_CUST_HK, you concatenate the business keys of the participating Hubs (in a sorted order to ensure consistency) and hash the result.
Satellites store the state of data over time. They are the only place where mutable attributes (like customer address or order status) reside. A Satellite must have a parent (usually a Hub) and uses the parent's Hash Key as its primary identifier, combined with the Load Date.
To detect changes efficiently in an MPP system, we calculate a HashDiff. This is a hash of all descriptive columns in the row. During the load process, we compare the incoming HashDiff with the most recent HashDiff in the target Satellite. If they differ, a new row is inserted. This eliminates expensive column-by-column comparisons.
SQL Implementation for SAT_CUSTOMER_DETAILS:
CREATE TABLE SAT_CUSTOMER_CRM (
HUB_CUSTOMER_HK BINARY(32) NOT NULL,
LOAD_DT TIMESTAMP_NTZ NOT NULL,
HASH_DIFF BINARY(32) NOT NULL, -- Used for Change Detection
FIRST_NAME VARCHAR(100),
LAST_NAME VARCHAR(100),
EMAIL VARCHAR(255),
ADDRESS_JSON VARIANT, -- Storing semi-structured address data
RECORD_SOURCE VARCHAR(50) NOT NULL,
CONSTRAINT PK_SAT_CUST PRIMARY KEY (HUB_CUSTOMER_HK, LOAD_DT),
CONSTRAINT FK_SAT_HUB FOREIGN KEY (HUB_CUSTOMER_HK) REFERENCES HUB_CUSTOMER (HUB_CUSTOMER_HK)
);
The HASH_DIFF ensures we only store actual changes. The formula applies to all descriptive columns:
We use a delimiter (like |) and handle NULL values to ensure that NULL + A is treated differently than A + NULL.
The design we just implemented enables a high degree of parallelism.
HashDiff against the current Satellite HashDiff.This INSERT-ONLY architecture creates a full audit trail. We never update rows. To get the current state of a customer, you query the Satellite for the row with the maximum LOAD_DT for a specific HUB_CUSTOMER_HK.
In a columnar storage system, this decomposition (normalization) might seem excessive compared to a wide table. However, MPP systems compress columns individually. The repetition of Hash Keys in Satellites compresses extremely well (Run-Length Encoding), and the separation prevents locking.
The following chart displays the storage distribution typically seen in a mature Data Vault. Satellites consume the majority of storage due to historical tracking, while Hubs and Links remain lightweight indexing structures.
Satellites inevitably grow the largest as they retain the full history of attribute changes. Hubs and Links stabilize once the core business entities and relationships are established.
By implementing this structure, you ensure that if the CRM adds a new field to the customer profile, you simply add a column to the Satellite (or create a new Satellite) without rebuilding the Hubs or Links. This adaptability is the primary engineering advantage of Data Vault in petabyte-scale warehouses.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with