Module 7: Data Vault 2.0 | Data Modeling Interview Mastery

🏛️ Why Data Vault?

Data Vault is a modeling methodology invented by Dan Linstedt in the 1990s (formalized 2000, DV 2.0 in 2013). It was designed for large-scale enterprise data warehouse integration where Kimball's approach starts to crack.

😤 Rigid Coupling

Kimball fact tables map to specific business processes. When the business restructures, your entire schema needs surgery.

🔒 Tightly Coupled ETL

Transformations happen before loading. A retroactive business rule change means reprocessing massive volumes.

🔍 Auditability Gaps

Once data enters a Kimball star, the raw source record is gone. Compliance can't see what System X sent on March 15th.

🚧 Dev Bottlenecks

Multiple teams can't build ETL independently when everything must conform to shared dimensions.

The Three Building Blocks

┌──────────────────────────────────────────────────────────┐
│                    DATA VAULT MODEL                       │
│                                                          │
│   ┌──────────┐     ┌──────────┐     ┌──────────────┐    │
│   │   HUB    │────▶│   LINK   │◀────│     HUB      │    │
│   │(Business │     │(Relation-│     │  (Business   │    │
│   │   Key)   │     │   ship)  │     │    Key)      │    │
│   └────┬─────┘     └────┬─────┘     └──────┬───────┘    │
│        │                │                   │            │
│   ┌────▼─────┐     ┌────▼─────┐     ┌──────▼───────┐    │
│   │SATELLITE │     │SATELLITE │     │  SATELLITE   │    │
│   │(Context/ │     │(Context/ │     │  (Context/   │    │
│   │ History) │     │ History) │     │   History)   │    │
│   └──────────┘     └──────────┘     └──────────────┘    │
│                                                          │
│   Hubs = WHAT exists (business keys)                     │
│   Links = HOW things relate                              │
│   Satellites = WHY / WHEN / descriptive context          │
└──────────────────────────────────────────────────────────┘

Hub (Business Key)

Link (Relationship)

Satellite (Context)

Hubs = WHAT · Links = HOW · Satellites = WHEN/WHY

📖 Analogy

Hubs are a passport registry — they record a person exists and assign a unique identifier. Links are immigration stamps — they record relationships. Satellites are the pages of detail — descriptions that version as they change. The registry never changes, relationships accumulate, details are versioned.

💡 Interview Insight

"Data Vault separates structure (Hubs and Links) from descriptive context (Satellites), making it agile. Every record carries its load timestamp and source system for full auditability. I'd use it when integrating many source systems where requirements change frequently and auditability is critical."

🔑 Hubs — The Anchor of Identity

A Hub represents a core business concept identified by its business key. It says "this customer exists," "this product exists" — nothing more.

Hub Structure
Column	Purpose	Example	Mutable?
hash_key (PK)	Deterministic hash of BK	MD5('CUST-10042')	Never
business_key	Natural key from source	'CUST-10042'	Never
load_date	When key first seen	2024-01-15 09:30	Never
record_source	Which system first	'CRM_SYSTEM'	Never

CREATE TABLE hub_customer (
    customer_hk        BINARY(16)      NOT NULL,  -- MD5 hash of customer_bk
    customer_bk        VARCHAR(50)     NOT NULL,  -- Business key (e.g., email)
    load_date          TIMESTAMP       NOT NULL,  -- First seen in warehouse
    record_source      VARCHAR(100)    NOT NULL,  -- Source system identifier

    CONSTRAINT pk_hub_customer PRIMARY KEY (customer_hk)
);

-- Loading a Hub: INSERT only if the business key doesn't exist
INSERT INTO hub_customer (customer_hk, customer_bk, load_date, record_source)
SELECT
    MD5(src.customer_email)         AS customer_hk,
    src.customer_email              AS customer_bk,
    CURRENT_TIMESTAMP               AS load_date,
    'ECOMMERCE_APP'                 AS record_source
FROM staging_customers src
WHERE NOT EXISTS (
    SELECT 1 FROM hub_customer h
    WHERE h.customer_hk = MD5(src.customer_email)
);

Composite Business Keys

Sometimes a concept requires multiple columns for identity (e.g., chain_code + store_number). Concatenate with a delimiter before hashing to prevent collisions:

-- Composite business key: chain_code + store_number
INSERT INTO hub_store (store_hk, chain_code, store_number, load_date, record_source)
SELECT
    MD5(CONCAT(src.chain_code, '||', src.store_number))  AS store_hk,
    src.chain_code, src.store_number, CURRENT_TIMESTAMP, 'POS_SYSTEM'
FROM staging_stores src
WHERE NOT EXISTS (
    SELECT 1 FROM hub_store h
    WHERE h.store_hk = MD5(CONCAT(src.chain_code, '||', src.store_number))
);

📖 Analogy

A Hub is like a birth certificate registry. It records that a person was born (exists), their legal name (business key), the date recorded, and which hospital filed it. You don't put height, weight, or address on the birth registry — that's descriptive data that changes. The registry is permanent.

💡 Interview Insight

"Hubs are immutable because they represent the existence of a business entity. Once registered, that fact doesn't change — only descriptive details change, and those go into Satellites. Immutability guarantees referential stability and simplifies loading to pure INSERT-if-not-exists operations."

🔗 Links — Modeling Relationships

A Link captures a relationship between two or more Hubs. In Data Vault, all relationships are many-to-many by default — a fundamental departure from Kimball.

Link Structure
Column	Purpose	Example
link_hk (PK)	Hash of combined BKs	MD5('CUST-42'+'PROD-99')
hub_1_hk (FK)	Hash key → first Hub	MD5('CUST-42')
hub_2_hk (FK)	Hash key → second Hub	MD5('PROD-99')
load_date	When first seen	2024-03-10 14:22
record_source	Which system reported	'ORDER_SYSTEM'

CREATE TABLE link_order (
    order_hk       BINARY(16)   NOT NULL,  -- MD5(cust_bk || order_id || prod_bk)
    customer_hk    BINARY(16)   NOT NULL,  -- FK to hub_customer
    product_hk     BINARY(16)   NOT NULL,  -- FK to hub_product
    order_hk_bk    VARCHAR(30)  NOT NULL,  -- Order business key
    load_date      TIMESTAMP    NOT NULL,
    record_source  VARCHAR(100) NOT NULL,

    CONSTRAINT pk_link_order PRIMARY KEY (order_hk)
);

Multi-Hub Links

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ hub_customer │     │  hub_product │     │  hub_store   │
└──────┬───────┘     └──────┬───────┘     └──────┬───────┘
       │                    │                    │
       └────────────┬───────┘────────────────────┘
                    │
             ┌──────▼───────┐
             │  link_sale   │  ← Customer bought Product at Store
             └──────────────┘

Handling Relationship Changes

When a customer cancels an order, the Link row stays — it records that the relationship once existed. Changes are tracked in a Satellite on the Link:

-- Satellite on the Link to track relationship status changes
CREATE TABLE sat_order_details (
    order_hk       BINARY(16)     NOT NULL,  -- FK to link_order
    load_date      TIMESTAMP      NOT NULL,
    hash_diff      BINARY(16)     NOT NULL,  -- Hash of descriptive cols
    record_source  VARCHAR(100)   NOT NULL,
    order_status   VARCHAR(20),             -- 'PLACED', 'SHIPPED', 'CANCELLED'
    order_amount   DECIMAL(12,2),
    shipping_method VARCHAR(50),

    CONSTRAINT pk_sat_order PRIMARY KEY (order_hk, load_date)
);

📖 Analogy

Links are like marriage certificates. A certificate records that two people entered a relationship. If they divorce, you don't destroy the certificate — you file a new document (Satellite record) recording the status change. The historical record is preserved.

💡 Interview Insight

"Links are immutable — they record that a relationship existed. Changes (cancellations, amendments) are captured in a Satellite attached to the Link. This preserves the full history: we know when the relationship was formed, when it changed status, and by which source system."

📡 Satellites — The History Keepers

If Hubs answer "WHAT exists?" and Links answer "HOW things relate?", then Satellites answer "WHAT do we know about it, and WHEN did we know it?"

One Satellite Per Source Rule

A critical DV 2.0 rule: one Satellite per source system per Hub or Link.

                    ┌──────────────┐
                    │ hub_customer │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
     ┌────────▼─────┐     │    ┌───────▼────────┐
     │sat_customer  │     │    │ sat_customer   │
     │  _crm        │     │    │   _billing     │
     └──────────────┘     │    └────────────────┘
                          │
                 ┌────────▼─────────┐
                 │ sat_customer     │
                 │   _web_profile   │
                 └──────────────────┘

📊 Different Change Rates

CRM updates monthly, billing updates daily. Mixing them stores redundant copies of unchanged fields on every billing change.

🔍 Source Auditability

"What did the billing system tell us about this customer on March 15?" — answerable without interference from other sources.

⚡ Parallel Loading

Each source loads its own Satellite independently — no contention between ETL streams.

CREATE TABLE sat_customer_crm (
    customer_hk    BINARY(16)    NOT NULL,  -- FK to hub_customer
    load_date      TIMESTAMP     NOT NULL,  -- When this version loaded
    load_end_date  TIMESTAMP     NULL,      -- When next version arrived
    hash_diff      BINARY(16)    NOT NULL,  -- MD5 of all descriptive cols
    record_source  VARCHAR(100)  NOT NULL,
    first_name     VARCHAR(100),
    last_name      VARCHAR(100),
    email          VARCHAR(200),
    phone          VARCHAR(20),
    loyalty_tier   VARCHAR(20),

    CONSTRAINT pk_sat_cust_crm PRIMARY KEY (customer_hk, load_date)
);

Satellites Replace SCD Complexity

Kimball SCD Type 2	Data Vault Satellite
Must decide SCD type per column upfront	All changes captured automatically
Surrogate key changes per version	Hub hash key is stable across versions
is_current flag requires UPDATE	New row inserted; optional load_end_date
Complex ETL: UPDATE + INSERT	INSERT-only (append-only pattern)
Mixed concerns: structure + history	Structure (Hub) / history (Sat) separated

📖 Analogy

Satellites are like medical records. Every doctor visit creates a new entry with current vitals. Nobody erases previous records — they stack chronologically. The hash_diff is like the nurse checking "has anything changed since last visit?"

💡 Interview Insight

"In Kimball, you decide SCD type per attribute upfront. In Data Vault, Satellites capture all history by default as INSERT-only appends. No current flags, no surrogate key management. The Hub's hash key is stable across all versions, and each source has its own Satellite for natural separation."

🔐 Hash Keys — Parallel Loading Engine

Traditional surrogate keys use auto-incrementing sequences — a serial bottleneck. Data Vault 2.0 replaces them with deterministic hash functions:

Source A ──── MD5('CUST-42') ──── INSERT (HK=0x7a3b...)  ← Independent!
Source B ──── MD5('CUST-42') ──── (same hash, skip!)     ← No coordination!
Source C ──── MD5('PROD-99') ──── INSERT (HK=0x4f1c...)  ← Parallel!

⚡ Parallel Loading

Any ETL process computes hash keys independently — no shared state, no locks, no sequence coordination.

🎯 Deterministic Joins

To join Hub to Link or Satellite, you compute the key — no lookups. MD5('CUST-42') always equals MD5('CUST-42').

🌐 Distribution-Friendly

Hash keys distribute evenly across database nodes. Sequential integers cause hot-spotting in BigQuery/Snowflake/Redshift.

🔄 Idempotent Loads

Re-running a load with the same source data produces the same hash keys, making loads safely repeatable.

Computing Hash Keys in SQL

-- Single-column business key
SELECT MD5(customer_email) AS customer_hk
FROM staging_customers;

-- Composite key: always use a consistent delimiter
SELECT MD5(CONCAT_WS('||',
    UPPER(TRIM(chain_code)),
    UPPER(TRIM(store_number))
)) AS store_hk
FROM staging_stores;

-- Link hash key: hash of all participating business keys
SELECT MD5(CONCAT_WS('||',
    UPPER(TRIM(customer_email)),
    UPPER(TRIM(order_id)),
    UPPER(TRIM(product_sku))
)) AS order_hk
FROM staging_orders;

-- hash_diff for Satellites (change detection)
SELECT MD5(CONCAT_WS('||',
    COALESCE(first_name, ''),
    COALESCE(last_name, ''),
    COALESCE(email, ''),
    COALESCE(phone, '')
)) AS hash_diff
FROM staging_crm_customers;

Critical Implementation Rules

Rule	Why
Always UPPER + TRIM before hashing	' John ' and 'john' should produce the same hash
Use a consistent delimiter like '\|\|'	Prevents ('AB','C') and ('A','BC') from colliding
COALESCE NULLs to empty string	MD5(NULL) = NULL, which breaks everything
Choose one hash function, stick with it	MD5 everywhere, or SHA-256 everywhere — never mix
Document your hashing convention	Everyone on the team must hash the same way

MD5 vs SHA-256

Factor	MD5	SHA-256
Speed	Faster	~30-40% slower
Storage	16 bytes	32 bytes
Collision risk	Negligible for DW	Essentially zero
Crypto security	Broken (don't use!)	Secure
DV adoption	Very common	Growing

💡 Interview Insight

"Hash keys enable parallel, distributed loading without coordination. Any ETL process can independently compute the same hash from the same business key. This makes loads idempotent and distributes evenly across nodes, avoiding hot-spotting in distributed architectures like BigQuery or Snowflake."

⚡ PIT Tables & Bridge Tables

Data Vault is optimized for loading, not querying. Reconstructing a complete entity at a point in time requires expensive correlated subqueries across multiple Satellites:

-- The SLOW way: correlated subqueries for point-in-time
SELECT h.customer_bk, crm.first_name, bill.payment_method
FROM hub_customer h
LEFT JOIN sat_customer_crm crm
    ON h.customer_hk = crm.customer_hk
    AND crm.load_date = (
        SELECT MAX(load_date) FROM sat_customer_crm
        WHERE customer_hk = h.customer_hk
        AND load_date <= '2024-03-15'
    )
LEFT JOIN sat_customer_billing bill
    ON h.customer_hk = bill.customer_hk
    AND bill.load_date = (
        SELECT MAX(load_date) FROM sat_customer_billing
        WHERE customer_hk = h.customer_hk
        AND load_date <= '2024-03-15'
    );

Point-in-Time (PIT) Tables

A PIT table pre-computes which Satellite record is effective for each Hub key at each snapshot date:

CREATE TABLE pit_customer (
    customer_hk            BINARY(16)   NOT NULL,
    snapshot_date          DATE          NOT NULL,
    sat_crm_load_date      TIMESTAMP,   -- → sat_customer_crm
    sat_billing_load_date  TIMESTAMP,   -- → sat_customer_billing
    sat_web_load_date      TIMESTAMP,   -- → sat_customer_web

    CONSTRAINT pk_pit_cust PRIMARY KEY (customer_hk, snapshot_date)
);

-- FAST point-in-time query using PIT → pure equi-joins!
SELECT
    h.customer_bk,
    crm.first_name, crm.last_name,
    bill.payment_method, bill.credit_score,
    web.last_login
FROM pit_customer pit
JOIN hub_customer h         ON pit.customer_hk = h.customer_hk
LEFT JOIN sat_customer_crm crm
    ON pit.customer_hk = crm.customer_hk
    AND pit.sat_crm_load_date = crm.load_date
LEFT JOIN sat_customer_billing bill
    ON pit.customer_hk = bill.customer_hk
    AND pit.sat_billing_load_date = bill.load_date
LEFT JOIN sat_customer_web web
    ON pit.customer_hk = web.customer_hk
    AND pit.sat_web_load_date = web.load_date
WHERE pit.snapshot_date = '2024-03-15';

The Architecture Flow

┌────────────────────────────────────────────────────────┐
│                  QUERY FLOW                             │
│                                                        │
│   Raw Vault              Business Vault    Info Mart   │
│   ┌──────────┐          ┌──────────────┐  ┌────────┐  │
│   │ Hubs     │───PIT───▶│ PIT Tables   │─▶│ Star   │  │
│   │ Links    │──Bridge─▶│ Bridge Tables│─▶│ Schema │  │
│   │Satellites│          │ Calc Sats    │  │        │  │
│   └──────────┘          └──────────────┘  └────────┘  │
└────────────────────────────────────────────────────────┘

📖 Analogy

PIT tables are like a library's daily catalog snapshot — instead of searching every book's checkout history, check the snapshot which already lists the state of every book. Bridge tables are like "recommended books" cross-reference cards that pre-compute relationships.

💡 Interview Insight

"Raw Data Vault is optimized for loading, not querying. For performance, we build PIT tables to eliminate correlated subqueries and Bridge tables to pre-walk Link relationships. Both sit in the Business Vault. For end-user consumption, we build star schema Information Marts — giving analysts the familiar Kimball experience backed by Data Vault's auditability."

⚖️ Data Vault vs Kimball

This isn't a religious war — Data Vault and Kimball solve different problems, and the best architectures often use both.

Criterion	Kimball Star Schema	Data Vault 2.0
Primary purpose	Reporting & analytics	Enterprise integration
Optimization	Query performance	Load agility & audit
Sources	Few (1-5 per mart)	Many (10-100+)
Schema changes	Painful ALTER + backfill	Add Satellites, no refactor
Auditability	Limited (transformed)	Full (source + timestamp)
History	SCD Types (complex)	Satellite append (simple)
Query complexity	Simple (star joins)	Complex (needs PIT/Bridge)
End-user access	Excellent (BI-native)	Poor (not for direct query)
Parallel dev	Bottlenecked on dims	Independent team loading

The Layered Architecture: Best of Both Worlds

┌─────────────────────────────────────────────────────────────────┐
│                 ENTERPRISE DATA WAREHOUSE                       │
│                                                                 │
│  ┌─────────────┐    ┌────────────────┐    ┌────────────────┐   │
│  │  STAGING    │    │   RAW VAULT    │    │ BUSINESS VAULT │   │
│  │  LAYER      │───▶│                │───▶│                │   │
│  │             │    │  Hubs          │    │  PIT Tables    │   │
│  │  Source     │    │  Links         │    │  Bridge Tables │   │
│  │  data as-is │    │  Satellites    │    │  Calc Sats     │   │
│  │  temp       │    │  No biz rules  │    │  Biz rules     │   │
│  └─────────────┘    └────────────────┘    └───────┬────────┘   │
│                                                   │             │
│                                                   ▼             │
│                                          ┌────────────────┐    │
│                                          │  INFO MART     │    │
│                                          │ (Star Schemas) │    │
│                                          │                │    │
│                                          │ fact_sales     │    │
│                                          │ dim_customer   │    │
│                                          │ dim_product    │    │
│                                          │                │    │
│                                          │ Kimball-style  │    │
│                                          │ for BI tools   │    │
│                                          └────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Decision Flowchart

                    ┌───────────────────────────┐
                    │ How many source systems?   │
                    └─────────────┬─────────────┘
                                  │
                ┌─────────────────┼─────────────────┐
                │ 1-5 sources     │                  │ 10+ sources
                ▼                 │                  ▼
        ┌───────────────┐        │       ┌──────────────────┐
        │ Requirements  │        │       │ Data Vault +     │
        │ stable?       │        │       │ Star Schema Marts│
        └───────┬───────┘        │       └──────────────────┘
                │                │
      ┌─────────┼─────────┐     │ 5-10 sources
      │ Yes              │ No   ▼
      ▼                  ▼  ┌───────────────┐
┌─────────────┐  ┌──────────│ Need audit    │
│ Kimball     │  │ DV +     │ trail?        │
│ Star Schema │  │ Star     └───────┬───────┘
└─────────────┘  │ Marts   Yes/     │     \No
                 └──────  ▼         ▼
                   DV + Marts    Kimball

📖 Analogy

Data Vault is the vault — the secure, fireproof, auditable repository where every document is stored in its original form. Kimball star schemas are the display cases — beautifully organized for visitors. You wouldn't let visitors rummage through the vault, and you wouldn't store originals in a display case. You need both.

💡 Interview Insight

"It's not either-or. Data Vault excels at the integration layer with many source systems and regulatory needs. Kimball excels at presentation — star schemas are intuitive for BI tools. The pattern: Raw Vault for ingestion, Business Vault for business rules with PIT and Bridge tables, and Kimball-style Information Marts for consumption. Kimball-only for small, stable projects. Add Data Vault when integrating 5-10+ sources or when compliance requires full lineage."

🧪 Quiz: Test Your Knowledge

Q1: What is the primary purpose of a Hub in Data Vault?

Q2: Why does Data Vault 2.0 use hash keys instead of auto-incrementing surrogates?

Q3: A customer's phone number changes. In Data Vault, what happens?

Q4: Why does Data Vault recommend one Satellite per source system per Hub?

Q5: What problem do Point-in-Time (PIT) tables solve?

Q6: In Data Vault architecture, where do business rules get applied?

Q7: Which architecture fits an enterprise with 15+ sources, regulatory audits, and a Tableau BI team?

🎯 Key Takeaways

1. Structure ≠ Content

Hubs & Links define what exists and how things relate (structure). Satellites define what we know (content). This separation = agility.

2. Hubs Are Immutable

Identity registries for business entities. Once registered, never changed. All descriptions and history → Satellites.

3. Links = Many-to-Many

All relationships modeled as M:N from the start. Status changes tracked in Link Satellites. No painful refactoring later.

4. INSERT-Only History

Satellites capture all history as appends. No UPDATEs, no current flags, no SCD complexity. One Satellite per source per Hub/Link.

5. Hash Keys = Parallelism

Deterministic hashing replaces sequences. Independent parallel loading, distributed joins, idempotent ops. Always UPPER, TRIM, COALESCE before hashing.

6. PIT + Bridge = Performance

PIT tables pre-compute temporal lookups. Bridge tables pre-walk Link relationships. Both sit in the Business Vault layer.

7. DV + Kimball = Best of Both

Raw Vault → Business Vault for integration & audit. Kimball star schemas (Info Marts) for BI consumption. Raw Vault never changes.

8. Know When NOT to Use DV

Small projects, few sources, stable requirements → Kimball-only is simpler. DV shines at 10+ sources with regulatory needs.