Data Vault is a modeling methodology invented by Dan Linstedt in the 1990s (formalized 2000, DV 2.0 in 2013). It was designed for large-scale enterprise data warehouse integration where Kimball's approach starts to crack.
π€ Rigid Coupling
Kimball fact tables map to specific business processes. When the business restructures, your entire schema needs surgery.
π Tightly Coupled ETL
Transformations happen before loading. A retroactive business rule change means reprocessing massive volumes.
π Auditability Gaps
Once data enters a Kimball star, the raw source record is gone. Compliance can't see what System X sent on March 15th.
π§ Dev Bottlenecks
Multiple teams can't build ETL independently when everything must conform to shared dimensions.
Hubs = WHAT Β· Links = HOW Β· Satellites = WHEN/WHY
π Analogy
Hubs are a passport registry β they record a person exists and assign a unique identifier. Links are immigration stamps β they record relationships. Satellites are the pages of detail β descriptions that version as they change. The registry never changes, relationships accumulate, details are versioned.
π‘ Interview Insight
"Data Vault separates structure (Hubs and Links) from descriptive context (Satellites), making it agile. Every record carries its load timestamp and source system for full auditability. I'd use it when integrating many source systems where requirements change frequently and auditability is critical."
π Hubs β The Anchor of Identity
A Hub represents a core business concept identified by its business key. It says "this customer exists," "this product exists" β nothing more.
Hub Structure
Column
Purpose
Example
Mutable?
hash_key (PK)
Deterministic hash of BK
MD5('CUST-10042')
Never
business_key
Natural key from source
'CUST-10042'
Never
load_date
When key first seen
2024-01-15 09:30
Never
record_source
Which system first
'CRM_SYSTEM'
Never
CREATE TABLE hub_customer (
customer_hk BINARY(16) NOT NULL, -- MD5 hash of customer_bk
customer_bk VARCHAR(50) NOT NULL, -- Business key (e.g., email)
load_date TIMESTAMPNOT NULL, -- First seen in warehouse
record_source VARCHAR(100) NOT NULL, -- Source system identifierCONSTRAINT pk_hub_customer PRIMARY KEY (customer_hk)
);
-- Loading a Hub: INSERT only if the business key doesn't existINSERT INTO hub_customer (customer_hk, customer_bk, load_date, record_source)
SELECTMD5(src.customer_email) AS customer_hk,
src.customer_email AS customer_bk,
CURRENT_TIMESTAMPAS load_date,
'ECOMMERCE_APP'AS record_source
FROM staging_customers src
WHERE NOT EXISTS (
SELECT1FROM hub_customer h
WHERE h.customer_hk = MD5(src.customer_email)
);
Composite Business Keys
Sometimes a concept requires multiple columns for identity (e.g., chain_code + store_number). Concatenate with a delimiter before hashing to prevent collisions:
-- Composite business key: chain_code + store_numberINSERT INTO hub_store (store_hk, chain_code, store_number, load_date, record_source)
SELECTMD5(CONCAT(src.chain_code, '||', src.store_number)) AS store_hk,
src.chain_code, src.store_number, CURRENT_TIMESTAMP, 'POS_SYSTEM'FROM staging_stores src
WHERE NOT EXISTS (
SELECT1FROM hub_store h
WHERE h.store_hk = MD5(CONCAT(src.chain_code, '||', src.store_number))
);
π Analogy
A Hub is like a birth certificate registry. It records that a person was born (exists), their legal name (business key), the date recorded, and which hospital filed it. You don't put height, weight, or address on the birth registry β that's descriptive data that changes. The registry is permanent.
π‘ Interview Insight
"Hubs are immutable because they represent the existence of a business entity. Once registered, that fact doesn't change β only descriptive details change, and those go into Satellites. Immutability guarantees referential stability and simplifies loading to pure INSERT-if-not-exists operations."
π Links β Modeling Relationships
A Link captures a relationship between two or more Hubs. In Data Vault, all relationships are many-to-many by default β a fundamental departure from Kimball.
Link Structure
Column
Purpose
Example
link_hk (PK)
Hash of combined BKs
MD5('CUST-42'+'PROD-99')
hub_1_hk (FK)
Hash key β first Hub
MD5('CUST-42')
hub_2_hk (FK)
Hash key β second Hub
MD5('PROD-99')
load_date
When first seen
2024-03-10 14:22
record_source
Which system reported
'ORDER_SYSTEM'
CREATE TABLE link_order (
order_hk BINARY(16) NOT NULL, -- MD5(cust_bk || order_id || prod_bk)
customer_hk BINARY(16) NOT NULL, -- FK to hub_customer
product_hk BINARY(16) NOT NULL, -- FK to hub_product
order_hk_bk VARCHAR(30) NOT NULL, -- Order business key
load_date TIMESTAMPNOT NULL,
record_source VARCHAR(100) NOT NULL,
CONSTRAINT pk_link_order PRIMARY KEY (order_hk)
);
When a customer cancels an order, the Link row stays β it records that the relationship once existed. Changes are tracked in a Satellite on the Link:
-- Satellite on the Link to track relationship status changesCREATE TABLE sat_order_details (
order_hk BINARY(16) NOT NULL, -- FK to link_order
load_date TIMESTAMPNOT NULL,
hash_diff BINARY(16) NOT NULL, -- Hash of descriptive cols
record_source VARCHAR(100) NOT NULL,
order_status VARCHAR(20), -- 'PLACED', 'SHIPPED', 'CANCELLED'
order_amount DECIMAL(12,2),
shipping_method VARCHAR(50),
CONSTRAINT pk_sat_order PRIMARY KEY (order_hk, load_date)
);
π Analogy
Links are like marriage certificates. A certificate records that two people entered a relationship. If they divorce, you don't destroy the certificate β you file a new document (Satellite record) recording the status change. The historical record is preserved.
π‘ Interview Insight
"Links are immutable β they record that a relationship existed. Changes (cancellations, amendments) are captured in a Satellite attached to the Link. This preserves the full history: we know when the relationship was formed, when it changed status, and by which source system."
π‘ Satellites β The History Keepers
If Hubs answer "WHAT exists?" and Links answer "HOW things relate?", then Satellites answer "WHAT do we know about it, and WHEN did we know it?"
One Satellite Per Source Rule
A critical DV 2.0 rule: one Satellite per source system per Hub or Link.
CRM updates monthly, billing updates daily. Mixing them stores redundant copies of unchanged fields on every billing change.
π Source Auditability
"What did the billing system tell us about this customer on March 15?" β answerable without interference from other sources.
β‘ Parallel Loading
Each source loads its own Satellite independently β no contention between ETL streams.
CREATE TABLE sat_customer_crm (
customer_hk BINARY(16) NOT NULL, -- FK to hub_customer
load_date TIMESTAMPNOT NULL, -- When this version loaded
load_end_date TIMESTAMPNULL, -- When next version arrived
hash_diff BINARY(16) NOT NULL, -- MD5 of all descriptive cols
record_source VARCHAR(100) NOT NULL,
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(200),
phone VARCHAR(20),
loyalty_tier VARCHAR(20),
CONSTRAINT pk_sat_cust_crm PRIMARY KEY (customer_hk, load_date)
);
Satellites Replace SCD Complexity
Kimball SCD Type 2
Data Vault Satellite
Must decide SCD type per column upfront
All changes captured automatically
Surrogate key changes per version
Hub hash key is stable across versions
is_current flag requires UPDATE
New row inserted; optional load_end_date
Complex ETL: UPDATE + INSERT
INSERT-only (append-only pattern)
Mixed concerns: structure + history
Structure (Hub) / history (Sat) separated
π Analogy
Satellites are like medical records. Every doctor visit creates a new entry with current vitals. Nobody erases previous records β they stack chronologically. The hash_diff is like the nurse checking "has anything changed since last visit?"
π‘ Interview Insight
"In Kimball, you decide SCD type per attribute upfront. In Data Vault, Satellites capture all history by default as INSERT-only appends. No current flags, no surrogate key management. The Hub's hash key is stable across all versions, and each source has its own Satellite for natural separation."
π Hash Keys β Parallel Loading Engine
Traditional surrogate keys use auto-incrementing sequences β a serial bottleneck. Data Vault 2.0 replaces them with deterministic hash functions:
Source A ββββ MD5('CUST-42') ββββ INSERT (HK=0x7a3b...) β Independent!
Source B ββββ MD5('CUST-42') ββββ (same hash, skip!) β No coordination!
Source C ββββ MD5('PROD-99') ββββ INSERT (HK=0x4f1c...) β Parallel!
β‘ Parallel Loading
Any ETL process computes hash keys independently β no shared state, no locks, no sequence coordination.
π― Deterministic Joins
To join Hub to Link or Satellite, you compute the key β no lookups. MD5('CUST-42') always equals MD5('CUST-42').
π Distribution-Friendly
Hash keys distribute evenly across database nodes. Sequential integers cause hot-spotting in BigQuery/Snowflake/Redshift.
π Idempotent Loads
Re-running a load with the same source data produces the same hash keys, making loads safely repeatable.
Computing Hash Keys in SQL
-- Single-column business keySELECTMD5(customer_email) AS customer_hk
FROM staging_customers;
-- Composite key: always use a consistent delimiterSELECTMD5(CONCAT_WS('||',
UPPER(TRIM(chain_code)),
UPPER(TRIM(store_number))
)) AS store_hk
FROM staging_stores;
-- Link hash key: hash of all participating business keysSELECTMD5(CONCAT_WS('||',
UPPER(TRIM(customer_email)),
UPPER(TRIM(order_id)),
UPPER(TRIM(product_sku))
)) AS order_hk
FROM staging_orders;
-- hash_diff for Satellites (change detection)SELECTMD5(CONCAT_WS('||',
COALESCE(first_name, ''),
COALESCE(last_name, ''),
COALESCE(email, ''),
COALESCE(phone, '')
)) AS hash_diff
FROM staging_crm_customers;
Critical Implementation Rules
Rule
Why
Always UPPER + TRIM before hashing
' John ' and 'john' should produce the same hash
Use a consistent delimiter like '||'
Prevents ('AB','C') and ('A','BC') from colliding
COALESCE NULLs to empty string
MD5(NULL) = NULL, which breaks everything
Choose one hash function, stick with it
MD5 everywhere, or SHA-256 everywhere β never mix
Document your hashing convention
Everyone on the team must hash the same way
MD5 vs SHA-256
Factor
MD5
SHA-256
Speed
Faster
~30-40% slower
Storage
16 bytes
32 bytes
Collision risk
Negligible for DW
Essentially zero
Crypto security
Broken (don't use!)
Secure
DV adoption
Very common
Growing
π‘ Interview Insight
"Hash keys enable parallel, distributed loading without coordination. Any ETL process can independently compute the same hash from the same business key. This makes loads idempotent and distributes evenly across nodes, avoiding hot-spotting in distributed architectures like BigQuery or Snowflake."
β‘ PIT Tables & Bridge Tables
Data Vault is optimized for loading, not querying. Reconstructing a complete entity at a point in time requires expensive correlated subqueries across multiple Satellites:
-- The SLOW way: correlated subqueries for point-in-timeSELECT h.customer_bk, crm.first_name, bill.payment_method
FROM hub_customer h
LEFT JOIN sat_customer_crm crm
ON h.customer_hk = crm.customer_hk
AND crm.load_date = (
SELECTMAX(load_date) FROM sat_customer_crm
WHERE customer_hk = h.customer_hk
AND load_date <= '2024-03-15'
)
LEFT JOIN sat_customer_billing bill
ON h.customer_hk = bill.customer_hk
AND bill.load_date = (
SELECTMAX(load_date) FROM sat_customer_billing
WHERE customer_hk = h.customer_hk
AND load_date <= '2024-03-15'
);
Point-in-Time (PIT) Tables
A PIT table pre-computes which Satellite record is effective for each Hub key at each snapshot date:
CREATE TABLE pit_customer (
customer_hk BINARY(16) NOT NULL,
snapshot_date DATENOT NULL,
sat_crm_load_date TIMESTAMP, -- β sat_customer_crm
sat_billing_load_date TIMESTAMP, -- β sat_customer_billing
sat_web_load_date TIMESTAMP, -- β sat_customer_webCONSTRAINT pk_pit_cust PRIMARY KEY (customer_hk, snapshot_date)
);
-- FAST point-in-time query using PIT β pure equi-joins!SELECT
h.customer_bk,
crm.first_name, crm.last_name,
bill.payment_method, bill.credit_score,
web.last_login
FROM pit_customer pit
JOIN hub_customer h ON pit.customer_hk = h.customer_hk
LEFT JOIN sat_customer_crm crm
ON pit.customer_hk = crm.customer_hk
AND pit.sat_crm_load_date = crm.load_date
LEFT JOIN sat_customer_billing bill
ON pit.customer_hk = bill.customer_hk
AND pit.sat_billing_load_date = bill.load_date
LEFT JOIN sat_customer_web web
ON pit.customer_hk = web.customer_hk
AND pit.sat_web_load_date = web.load_date
WHERE pit.snapshot_date = '2024-03-15';
PIT tables are like a library's daily catalog snapshot β instead of searching every book's checkout history, check the snapshot which already lists the state of every book. Bridge tables are like "recommended books" cross-reference cards that pre-compute relationships.
π‘ Interview Insight
"Raw Data Vault is optimized for loading, not querying. For performance, we build PIT tables to eliminate correlated subqueries and Bridge tables to pre-walk Link relationships. Both sit in the Business Vault. For end-user consumption, we build star schema Information Marts β giving analysts the familiar Kimball experience backed by Data Vault's auditability."
βοΈ Data Vault vs Kimball
This isn't a religious war β Data Vault and Kimball solve different problems, and the best architectures often use both.
βββββββββββββββββββββββββββββ
β How many source systems? β
βββββββββββββββ¬ββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
β 1-5 sources β β 10+ sources
βΌ β βΌ
βββββββββββββββββ β ββββββββββββββββββββ
β Requirements β β β Data Vault + β
β stable? β β β Star Schema Martsβ
βββββββββ¬ββββββββ β ββββββββββββββββββββ
β β
βββββββββββΌββββββββββ β 5-10 sources
β Yes β No βΌ
βΌ βΌ βββββββββββββββββ
βββββββββββββββ ββββββββββββ Need audit β
β Kimball β β DV + β trail? β
β Star Schema β β Star βββββββββ¬ββββββββ
βββββββββββββββ β Marts Yes/ β \No
βββββββ βΌ βΌ
DV + Marts Kimball
π Analogy
Data Vault is the vault β the secure, fireproof, auditable repository where every document is stored in its original form. Kimball star schemas are the display cases β beautifully organized for visitors. You wouldn't let visitors rummage through the vault, and you wouldn't store originals in a display case. You need both.
π‘ Interview Insight
"It's not either-or. Data Vault excels at the integration layer with many source systems and regulatory needs. Kimball excels at presentation β star schemas are intuitive for BI tools. The pattern: Raw Vault for ingestion, Business Vault for business rules with PIT and Bridge tables, and Kimball-style Information Marts for consumption. Kimball-only for small, stable projects. Add Data Vault when integrating 5-10+ sources or when compliance requires full lineage."
π§ͺ Quiz: Test Your Knowledge
Q1: What is the primary purpose of a Hub in Data Vault?
Q2: Why does Data Vault 2.0 use hash keys instead of auto-incrementing surrogates?
Q3: A customer's phone number changes. In Data Vault, what happens?
Q4: Why does Data Vault recommend one Satellite per source system per Hub?
Q5: What problem do Point-in-Time (PIT) tables solve?
Q6: In Data Vault architecture, where do business rules get applied?
Q7: Which architecture fits an enterprise with 15+ sources, regulatory audits, and a Tableau BI team?
π― Key Takeaways
1. Structure β Content
Hubs & Links define what exists and how things relate (structure). Satellites define what we know (content). This separation = agility.
2. Hubs Are Immutable
Identity registries for business entities. Once registered, never changed. All descriptions and history β Satellites.
3. Links = Many-to-Many
All relationships modeled as M:N from the start. Status changes tracked in Link Satellites. No painful refactoring later.
4. INSERT-Only History
Satellites capture all history as appends. No UPDATEs, no current flags, no SCD complexity. One Satellite per source per Hub/Link.