πŸ›οΈ Why Data Vault?

Data Vault is a modeling methodology invented by Dan Linstedt in the 1990s (formalized 2000, DV 2.0 in 2013). It was designed for large-scale enterprise data warehouse integration where Kimball's approach starts to crack.

😀 Rigid Coupling

Kimball fact tables map to specific business processes. When the business restructures, your entire schema needs surgery.

πŸ”’ Tightly Coupled ETL

Transformations happen before loading. A retroactive business rule change means reprocessing massive volumes.

πŸ” Auditability Gaps

Once data enters a Kimball star, the raw source record is gone. Compliance can't see what System X sent on March 15th.

🚧 Dev Bottlenecks

Multiple teams can't build ETL independently when everything must conform to shared dimensions.

The Three Building Blocks

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ DATA VAULT MODEL β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ HUB │────▢│ LINK │◀────│ HUB β”‚ β”‚ β”‚ β”‚(Business β”‚ β”‚(Relation-β”‚ β”‚ (Business β”‚ β”‚ β”‚ β”‚ Key) β”‚ β”‚ ship) β”‚ β”‚ Key) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚SATELLITE β”‚ β”‚SATELLITE β”‚ β”‚ SATELLITE β”‚ β”‚ β”‚ β”‚(Context/ β”‚ β”‚(Context/ β”‚ β”‚ (Context/ β”‚ β”‚ β”‚ β”‚ History) β”‚ β”‚ History) β”‚ β”‚ History) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ Hubs = WHAT exists (business keys) β”‚ β”‚ Links = HOW things relate β”‚ β”‚ Satellites = WHY / WHEN / descriptive context β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Hub (Business Key)
Satellite (Context)
Hubs = WHAT Β· Links = HOW Β· Satellites = WHEN/WHY
πŸ“– Analogy

Hubs are a passport registry β€” they record a person exists and assign a unique identifier. Links are immigration stamps β€” they record relationships. Satellites are the pages of detail β€” descriptions that version as they change. The registry never changes, relationships accumulate, details are versioned.

πŸ’‘ Interview Insight

"Data Vault separates structure (Hubs and Links) from descriptive context (Satellites), making it agile. Every record carries its load timestamp and source system for full auditability. I'd use it when integrating many source systems where requirements change frequently and auditability is critical."

πŸ”‘ Hubs β€” The Anchor of Identity

A Hub represents a core business concept identified by its business key. It says "this customer exists," "this product exists" β€” nothing more.

Hub Structure
ColumnPurposeExampleMutable?
hash_key (PK)Deterministic hash of BKMD5('CUST-10042')Never
business_keyNatural key from source'CUST-10042'Never
load_dateWhen key first seen2024-01-15 09:30Never
record_sourceWhich system first'CRM_SYSTEM'Never
CREATE TABLE hub_customer ( customer_hk BINARY(16) NOT NULL, -- MD5 hash of customer_bk customer_bk VARCHAR(50) NOT NULL, -- Business key (e.g., email) load_date TIMESTAMP NOT NULL, -- First seen in warehouse record_source VARCHAR(100) NOT NULL, -- Source system identifier CONSTRAINT pk_hub_customer PRIMARY KEY (customer_hk) ); -- Loading a Hub: INSERT only if the business key doesn't exist INSERT INTO hub_customer (customer_hk, customer_bk, load_date, record_source) SELECT MD5(src.customer_email) AS customer_hk, src.customer_email AS customer_bk, CURRENT_TIMESTAMP AS load_date, 'ECOMMERCE_APP' AS record_source FROM staging_customers src WHERE NOT EXISTS ( SELECT 1 FROM hub_customer h WHERE h.customer_hk = MD5(src.customer_email) );

Composite Business Keys

Sometimes a concept requires multiple columns for identity (e.g., chain_code + store_number). Concatenate with a delimiter before hashing to prevent collisions:

-- Composite business key: chain_code + store_number INSERT INTO hub_store (store_hk, chain_code, store_number, load_date, record_source) SELECT MD5(CONCAT(src.chain_code, '||', src.store_number)) AS store_hk, src.chain_code, src.store_number, CURRENT_TIMESTAMP, 'POS_SYSTEM' FROM staging_stores src WHERE NOT EXISTS ( SELECT 1 FROM hub_store h WHERE h.store_hk = MD5(CONCAT(src.chain_code, '||', src.store_number)) );
πŸ“– Analogy

A Hub is like a birth certificate registry. It records that a person was born (exists), their legal name (business key), the date recorded, and which hospital filed it. You don't put height, weight, or address on the birth registry β€” that's descriptive data that changes. The registry is permanent.

πŸ’‘ Interview Insight

"Hubs are immutable because they represent the existence of a business entity. Once registered, that fact doesn't change β€” only descriptive details change, and those go into Satellites. Immutability guarantees referential stability and simplifies loading to pure INSERT-if-not-exists operations."

πŸ”— Links β€” Modeling Relationships

A Link captures a relationship between two or more Hubs. In Data Vault, all relationships are many-to-many by default β€” a fundamental departure from Kimball.

Link Structure
ColumnPurposeExample
link_hk (PK)Hash of combined BKsMD5('CUST-42'+'PROD-99')
hub_1_hk (FK)Hash key β†’ first HubMD5('CUST-42')
hub_2_hk (FK)Hash key β†’ second HubMD5('PROD-99')
load_dateWhen first seen2024-03-10 14:22
record_sourceWhich system reported'ORDER_SYSTEM'
CREATE TABLE link_order ( order_hk BINARY(16) NOT NULL, -- MD5(cust_bk || order_id || prod_bk) customer_hk BINARY(16) NOT NULL, -- FK to hub_customer product_hk BINARY(16) NOT NULL, -- FK to hub_product order_hk_bk VARCHAR(30) NOT NULL, -- Order business key load_date TIMESTAMP NOT NULL, record_source VARCHAR(100) NOT NULL, CONSTRAINT pk_link_order PRIMARY KEY (order_hk) );

Multi-Hub Links

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ hub_customer β”‚ β”‚ hub_product β”‚ β”‚ hub_store β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”‚ link_sale β”‚ ← Customer bought Product at Store β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Handling Relationship Changes

When a customer cancels an order, the Link row stays β€” it records that the relationship once existed. Changes are tracked in a Satellite on the Link:

-- Satellite on the Link to track relationship status changes CREATE TABLE sat_order_details ( order_hk BINARY(16) NOT NULL, -- FK to link_order load_date TIMESTAMP NOT NULL, hash_diff BINARY(16) NOT NULL, -- Hash of descriptive cols record_source VARCHAR(100) NOT NULL, order_status VARCHAR(20), -- 'PLACED', 'SHIPPED', 'CANCELLED' order_amount DECIMAL(12,2), shipping_method VARCHAR(50), CONSTRAINT pk_sat_order PRIMARY KEY (order_hk, load_date) );
πŸ“– Analogy

Links are like marriage certificates. A certificate records that two people entered a relationship. If they divorce, you don't destroy the certificate β€” you file a new document (Satellite record) recording the status change. The historical record is preserved.

πŸ’‘ Interview Insight

"Links are immutable β€” they record that a relationship existed. Changes (cancellations, amendments) are captured in a Satellite attached to the Link. This preserves the full history: we know when the relationship was formed, when it changed status, and by which source system."

πŸ“‘ Satellites β€” The History Keepers

If Hubs answer "WHAT exists?" and Links answer "HOW things relate?", then Satellites answer "WHAT do we know about it, and WHEN did we know it?"

One Satellite Per Source Rule

A critical DV 2.0 rule: one Satellite per source system per Hub or Link.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ hub_customer β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚sat_customer β”‚ β”‚ β”‚ sat_customer β”‚ β”‚ _crm β”‚ β”‚ β”‚ _billing β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ sat_customer β”‚ β”‚ _web_profile β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“Š Different Change Rates

CRM updates monthly, billing updates daily. Mixing them stores redundant copies of unchanged fields on every billing change.

πŸ” Source Auditability

"What did the billing system tell us about this customer on March 15?" β€” answerable without interference from other sources.

⚑ Parallel Loading

Each source loads its own Satellite independently β€” no contention between ETL streams.

CREATE TABLE sat_customer_crm ( customer_hk BINARY(16) NOT NULL, -- FK to hub_customer load_date TIMESTAMP NOT NULL, -- When this version loaded load_end_date TIMESTAMP NULL, -- When next version arrived hash_diff BINARY(16) NOT NULL, -- MD5 of all descriptive cols record_source VARCHAR(100) NOT NULL, first_name VARCHAR(100), last_name VARCHAR(100), email VARCHAR(200), phone VARCHAR(20), loyalty_tier VARCHAR(20), CONSTRAINT pk_sat_cust_crm PRIMARY KEY (customer_hk, load_date) );

Satellites Replace SCD Complexity

Kimball SCD Type 2Data Vault Satellite
Must decide SCD type per column upfrontAll changes captured automatically
Surrogate key changes per versionHub hash key is stable across versions
is_current flag requires UPDATENew row inserted; optional load_end_date
Complex ETL: UPDATE + INSERTINSERT-only (append-only pattern)
Mixed concerns: structure + historyStructure (Hub) / history (Sat) separated
πŸ“– Analogy

Satellites are like medical records. Every doctor visit creates a new entry with current vitals. Nobody erases previous records β€” they stack chronologically. The hash_diff is like the nurse checking "has anything changed since last visit?"

πŸ’‘ Interview Insight

"In Kimball, you decide SCD type per attribute upfront. In Data Vault, Satellites capture all history by default as INSERT-only appends. No current flags, no surrogate key management. The Hub's hash key is stable across all versions, and each source has its own Satellite for natural separation."

πŸ” Hash Keys β€” Parallel Loading Engine

Traditional surrogate keys use auto-incrementing sequences β€” a serial bottleneck. Data Vault 2.0 replaces them with deterministic hash functions:

Source A ──── MD5('CUST-42') ──── INSERT (HK=0x7a3b...) ← Independent! Source B ──── MD5('CUST-42') ──── (same hash, skip!) ← No coordination! Source C ──── MD5('PROD-99') ──── INSERT (HK=0x4f1c...) ← Parallel!

⚑ Parallel Loading

Any ETL process computes hash keys independently β€” no shared state, no locks, no sequence coordination.

🎯 Deterministic Joins

To join Hub to Link or Satellite, you compute the key β€” no lookups. MD5('CUST-42') always equals MD5('CUST-42').

🌐 Distribution-Friendly

Hash keys distribute evenly across database nodes. Sequential integers cause hot-spotting in BigQuery/Snowflake/Redshift.

πŸ”„ Idempotent Loads

Re-running a load with the same source data produces the same hash keys, making loads safely repeatable.

Computing Hash Keys in SQL

-- Single-column business key SELECT MD5(customer_email) AS customer_hk FROM staging_customers; -- Composite key: always use a consistent delimiter SELECT MD5(CONCAT_WS('||', UPPER(TRIM(chain_code)), UPPER(TRIM(store_number)) )) AS store_hk FROM staging_stores; -- Link hash key: hash of all participating business keys SELECT MD5(CONCAT_WS('||', UPPER(TRIM(customer_email)), UPPER(TRIM(order_id)), UPPER(TRIM(product_sku)) )) AS order_hk FROM staging_orders; -- hash_diff for Satellites (change detection) SELECT MD5(CONCAT_WS('||', COALESCE(first_name, ''), COALESCE(last_name, ''), COALESCE(email, ''), COALESCE(phone, '') )) AS hash_diff FROM staging_crm_customers;

Critical Implementation Rules

RuleWhy
Always UPPER + TRIM before hashing' John ' and 'john' should produce the same hash
Use a consistent delimiter like '||'Prevents ('AB','C') and ('A','BC') from colliding
COALESCE NULLs to empty stringMD5(NULL) = NULL, which breaks everything
Choose one hash function, stick with itMD5 everywhere, or SHA-256 everywhere β€” never mix
Document your hashing conventionEveryone on the team must hash the same way

MD5 vs SHA-256

FactorMD5SHA-256
SpeedFaster~30-40% slower
Storage16 bytes32 bytes
Collision riskNegligible for DWEssentially zero
Crypto securityBroken (don't use!)Secure
DV adoptionVery commonGrowing
πŸ’‘ Interview Insight

"Hash keys enable parallel, distributed loading without coordination. Any ETL process can independently compute the same hash from the same business key. This makes loads idempotent and distributes evenly across nodes, avoiding hot-spotting in distributed architectures like BigQuery or Snowflake."

⚑ PIT Tables & Bridge Tables

Data Vault is optimized for loading, not querying. Reconstructing a complete entity at a point in time requires expensive correlated subqueries across multiple Satellites:

-- The SLOW way: correlated subqueries for point-in-time SELECT h.customer_bk, crm.first_name, bill.payment_method FROM hub_customer h LEFT JOIN sat_customer_crm crm ON h.customer_hk = crm.customer_hk AND crm.load_date = ( SELECT MAX(load_date) FROM sat_customer_crm WHERE customer_hk = h.customer_hk AND load_date <= '2024-03-15' ) LEFT JOIN sat_customer_billing bill ON h.customer_hk = bill.customer_hk AND bill.load_date = ( SELECT MAX(load_date) FROM sat_customer_billing WHERE customer_hk = h.customer_hk AND load_date <= '2024-03-15' );

Point-in-Time (PIT) Tables

A PIT table pre-computes which Satellite record is effective for each Hub key at each snapshot date:

CREATE TABLE pit_customer ( customer_hk BINARY(16) NOT NULL, snapshot_date DATE NOT NULL, sat_crm_load_date TIMESTAMP, -- β†’ sat_customer_crm sat_billing_load_date TIMESTAMP, -- β†’ sat_customer_billing sat_web_load_date TIMESTAMP, -- β†’ sat_customer_web CONSTRAINT pk_pit_cust PRIMARY KEY (customer_hk, snapshot_date) ); -- FAST point-in-time query using PIT β†’ pure equi-joins! SELECT h.customer_bk, crm.first_name, crm.last_name, bill.payment_method, bill.credit_score, web.last_login FROM pit_customer pit JOIN hub_customer h ON pit.customer_hk = h.customer_hk LEFT JOIN sat_customer_crm crm ON pit.customer_hk = crm.customer_hk AND pit.sat_crm_load_date = crm.load_date LEFT JOIN sat_customer_billing bill ON pit.customer_hk = bill.customer_hk AND pit.sat_billing_load_date = bill.load_date LEFT JOIN sat_customer_web web ON pit.customer_hk = web.customer_hk AND pit.sat_web_load_date = web.load_date WHERE pit.snapshot_date = '2024-03-15';

The Architecture Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ QUERY FLOW β”‚ β”‚ β”‚ β”‚ Raw Vault Business Vault Info Mart β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Hubs │───PIT───▢│ PIT Tables │─▢│ Star β”‚ β”‚ β”‚ β”‚ Links │──Bridge─▢│ Bridge Tables│─▢│ Schema β”‚ β”‚ β”‚ β”‚Satellitesβ”‚ β”‚ Calc Sats β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
πŸ“– Analogy

PIT tables are like a library's daily catalog snapshot β€” instead of searching every book's checkout history, check the snapshot which already lists the state of every book. Bridge tables are like "recommended books" cross-reference cards that pre-compute relationships.

πŸ’‘ Interview Insight

"Raw Data Vault is optimized for loading, not querying. For performance, we build PIT tables to eliminate correlated subqueries and Bridge tables to pre-walk Link relationships. Both sit in the Business Vault. For end-user consumption, we build star schema Information Marts β€” giving analysts the familiar Kimball experience backed by Data Vault's auditability."

βš–οΈ Data Vault vs Kimball

This isn't a religious war β€” Data Vault and Kimball solve different problems, and the best architectures often use both.

CriterionKimball Star SchemaData Vault 2.0
Primary purposeReporting & analyticsEnterprise integration
OptimizationQuery performanceLoad agility & audit
SourcesFew (1-5 per mart)Many (10-100+)
Schema changesPainful ALTER + backfillAdd Satellites, no refactor
AuditabilityLimited (transformed)Full (source + timestamp)
HistorySCD Types (complex)Satellite append (simple)
Query complexitySimple (star joins)Complex (needs PIT/Bridge)
End-user accessExcellent (BI-native)Poor (not for direct query)
Parallel devBottlenecked on dimsIndependent team loading

The Layered Architecture: Best of Both Worlds

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ENTERPRISE DATA WAREHOUSE β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ STAGING β”‚ β”‚ RAW VAULT β”‚ β”‚ BUSINESS VAULT β”‚ β”‚ β”‚ β”‚ LAYER │───▢│ │───▢│ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Hubs β”‚ β”‚ PIT Tables β”‚ β”‚ β”‚ β”‚ Source β”‚ β”‚ Links β”‚ β”‚ Bridge Tables β”‚ β”‚ β”‚ β”‚ data as-is β”‚ β”‚ Satellites β”‚ β”‚ Calc Sats β”‚ β”‚ β”‚ β”‚ temp β”‚ β”‚ No biz rules β”‚ β”‚ Biz rules β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ INFO MART β”‚ β”‚ β”‚ β”‚ (Star Schemas) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ fact_sales β”‚ β”‚ β”‚ β”‚ dim_customer β”‚ β”‚ β”‚ β”‚ dim_product β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Kimball-style β”‚ β”‚ β”‚ β”‚ for BI tools β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Decision Flowchart

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ How many source systems? β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ 1-5 sources β”‚ β”‚ 10+ sources β–Ό β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Requirements β”‚ β”‚ β”‚ Data Vault + β”‚ β”‚ stable? β”‚ β”‚ β”‚ Star Schema Martsβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ 5-10 sources β”‚ Yes β”‚ No β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚ Need audit β”‚ β”‚ Kimball β”‚ β”‚ DV + β”‚ trail? β”‚ β”‚ Star Schema β”‚ β”‚ Star β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Marts Yes/ β”‚ \No └────── β–Ό β–Ό DV + Marts Kimball
πŸ“– Analogy

Data Vault is the vault β€” the secure, fireproof, auditable repository where every document is stored in its original form. Kimball star schemas are the display cases β€” beautifully organized for visitors. You wouldn't let visitors rummage through the vault, and you wouldn't store originals in a display case. You need both.

πŸ’‘ Interview Insight

"It's not either-or. Data Vault excels at the integration layer with many source systems and regulatory needs. Kimball excels at presentation β€” star schemas are intuitive for BI tools. The pattern: Raw Vault for ingestion, Business Vault for business rules with PIT and Bridge tables, and Kimball-style Information Marts for consumption. Kimball-only for small, stable projects. Add Data Vault when integrating 5-10+ sources or when compliance requires full lineage."

πŸ§ͺ Quiz: Test Your Knowledge

Q1: What is the primary purpose of a Hub in Data Vault?

Q2: Why does Data Vault 2.0 use hash keys instead of auto-incrementing surrogates?

Q3: A customer's phone number changes. In Data Vault, what happens?

Q4: Why does Data Vault recommend one Satellite per source system per Hub?

Q5: What problem do Point-in-Time (PIT) tables solve?

Q6: In Data Vault architecture, where do business rules get applied?

Q7: Which architecture fits an enterprise with 15+ sources, regulatory audits, and a Tableau BI team?

🎯 Key Takeaways

1. Structure β‰  Content

Hubs & Links define what exists and how things relate (structure). Satellites define what we know (content). This separation = agility.

2. Hubs Are Immutable

Identity registries for business entities. Once registered, never changed. All descriptions and history β†’ Satellites.

3. Links = Many-to-Many

All relationships modeled as M:N from the start. Status changes tracked in Link Satellites. No painful refactoring later.

4. INSERT-Only History

Satellites capture all history as appends. No UPDATEs, no current flags, no SCD complexity. One Satellite per source per Hub/Link.

5. Hash Keys = Parallelism

Deterministic hashing replaces sequences. Independent parallel loading, distributed joins, idempotent ops. Always UPPER, TRIM, COALESCE before hashing.

6. PIT + Bridge = Performance

PIT tables pre-compute temporal lookups. Bridge tables pre-walk Link relationships. Both sit in the Business Vault layer.

7. DV + Kimball = Best of Both

Raw Vault β†’ Business Vault for integration & audit. Kimball star schemas (Info Marts) for BI consumption. Raw Vault never changes.

8. Know When NOT to Use DV

Small projects, few sources, stable requirements β†’ Kimball-only is simpler. DV shines at 10+ sources with regulatory needs.