Module 9: Modeling for Lakehouse | Data Modeling Interview Mastery

🏅 Medallion Architecture — The Layered Foundation

The Medallion Architecture solves the data swamp problem by organizing data into three layers of increasing quality: Bronze, Silver, and Gold.

📥 Sources

→

🥉 Bronze

→

🥈 Silver

→

🥇 Gold

Schema-on-read → Schema enforced → Business-optimized

Append-only → Deduplicated → Star schemas / OBTs

🥉 Bronze (Raw)

Store as-is from source. Append-only. Add metadata: _ingested_at, _source_file. Your immutable audit trail.

🥈 Silver (Cleaned)

Enforce types, deduplicate, resolve business keys, apply SCD logic. One row per entity per version.

🥇 Gold (Business)

Star schemas, OBTs, pre-aggregations. Purpose-built for dashboards, ML features, executive reports.

-- Example folder/table structure
lakehouse/
├── bronze/
│   ├── ecommerce/
│   │   ├── raw_orders/          -- partitioned by _ingested_date
│   │   └── raw_customers/       -- partitioned by _ingested_date
│   └── erp/
│       └── raw_invoices/
├── silver/
│   ├── clean_orders/            -- partitioned by order_date
│   ├── clean_customers/         -- no partition (small dim)
│   └── clean_products/          -- no partition (small dim)
└── gold/
    ├── daily_sales_summary/     -- partitioned by sale_date
    ├── customer_360/            -- one big table
    └── customer_ltv/

📖 Analogy

Medallion is like a water treatment plant. Bronze is raw intake — pump in whatever comes from the river (mud, fish, debris). Silver is filtration — remove impurities, standardize pH, test for contaminants. Gold is tap water — clean, safe, optimized for drinking. You never throw away the raw intake logs because if someone gets sick, you trace back to the source.

💡 Interview Insight

\"Medallion organizes lakehouse data into three layers of increasing quality. Bronze guarantees completeness, Silver guarantees correctness, and Gold guarantees performance. It works for both batch and streaming because the layer contracts are independent of the execution engine.\"

📋 Schema-on-Read vs Schema-on-Write

The question of when to enforce schema determines where your team spends debugging time, how fast you onboard new sources, and whether consumers trust the data.

📝 Schema-on-Write

✅ Errors caught immediately — no garbage in
✅ Consumers trust data — schema is a contract
✅ Predictable query performance
⚠️ Rigid — ALTER TABLE + ETL changes for new columns
⚠️ Slow onboarding — schema mapping first
⚠️ Upstream changes break pipelines instantly

📖 Schema-on-Read

✅ Fast ingestion — no upfront schema design
✅ Flexible — store anything, decide meaning later
✅ No data loss — every record preserved
⚠️ Errors shift downstream to query time
⚠️ Every reader reimplements parsing logic
⚠️ Data swamp risk without discipline

The lakehouse resolves this with a hybrid: schema-on-read at Bronze, schema-on-write at Silver.

-- Bronze: store raw, both API versions coexist peacefully
-- v1: {"customer_id": 12345, "name": "Alice"}
-- v2: {"customer_id": "CUST-12345", "name": "Alice"}

-- Silver: transformation handles both versions
INSERT INTO silver.clean_customers
SELECT
    CASE
        WHEN typeof(raw:customer_id) = 'INTEGER'
        THEN CONCAT('CUST-', CAST(raw:customer_id AS STRING))
        ELSE raw:customer_id::STRING
    END AS customer_id,
    raw:name::STRING AS customer_name
FROM bronze.raw_customers
WHERE raw:customer_id IS NOT NULL;

Scenario	Strategy	Why
Financial / regulatory data	Strict schema-on-write	Compliance requires integrity
Exploratory / new sources	Schema-on-read at Bronze	Don't block ingestion
ML feature stores	Strict at Silver, flex at Bronze	Type consistency for training
Event / clickstream data	Schema-on-read at Bronze	High volume, evolving schema
Master data (customer, product)	Strict schema-on-write at Silver	Core entities need SSOT

📖 Analogy

Schema-on-write is airport security — check everything before boarding. Schema-on-read is a train station — anyone can board. The lakehouse hybrid is TSA PreCheck — let everyone into the terminal (Bronze), but validate thoroughly before they board the flight (Silver).

🎯 When to Model at Silver vs Gold

Everyone understands Bronze (dump raw) and Gold (build dashboards). The real craft is deciding how much modeling happens at Silver versus Gold.

🥈 Model at Silver when…

📌 Multiple Gold tables need the same cleaned entity
📌 You need a single source of truth for a business key
📌 Dedup/conforming logic should be applied once
📌 SCD history tracking is needed at entity level

🥇 Model at Gold when…

📌 Aggregation is specific to one business process
📌 You need to denormalize for query performance
📌 Pre-computation saves significant dashboard runtime
📌 The model represents a specific KPI (CLV, churn)

-- ❌ WRONG: Building star schemas in Silver (premature coupling)
silver.fact_sales           -- Too specific for Silver
silver.dim_customer         -- "dim_" implies Gold-level design
silver.bridge_promotions    -- Way too specific

-- ✅ RIGHT: Entity-centric Silver
silver.clean_sales_transactions  -- One row per transaction
silver.clean_customers           -- One row per customer
silver.clean_products            -- One row per product

-- SILVER: cleaned once, reused everywhere
CREATE TABLE silver.clean_orders AS
SELECT order_id, customer_id, order_date,
    CAST(total_amount AS DECIMAL(12,2)) AS total_amount,
    order_status, updated_at
FROM bronze.raw_orders
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY order_id ORDER BY updated_at DESC
) = 1;

-- GOLD 1: Finance needs daily revenue
CREATE TABLE gold.daily_revenue AS
SELECT order_date, SUM(total_amount) AS total_revenue
FROM silver.clean_orders
WHERE order_status = 'completed'
GROUP BY order_date;

-- GOLD 2: Marketing needs customer LTV
CREATE TABLE gold.customer_ltv AS
SELECT c.customer_id, SUM(o.total_amount) AS lifetime_value
FROM silver.clean_customers c
LEFT JOIN silver.clean_orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id;

💡 Interview Insight

\"Dimensional modeling lives in the Gold layer. Silver tables are entity-centric — cleaned and conformed but not shaped for any specific business process. Model at Silver what's universally true about an entity; model at Gold what's specifically useful for a business question.\"

📂 Partitioning Strategy

In a lakehouse, your partition strategy is inseparable from your data model. The wrong partition key turns a sub-second query into a full-scan nightmare. The right one eliminates 99% of data scanned via partition pruning.

-- How partitioning works: file-system-level directories
delta_table/
├── order_date=2024-01-01/
│   ├── part-00000.parquet    (50 MB)
│   └── part-00001.parquet    (48 MB)
├── order_date=2024-01-02/
│   ├── part-00000.parquet    (52 MB)
│   └── part-00001.parquet    (47 MB)

-- Query: WHERE order_date = '2024-01-02'
-- Engine reads ONLY the 2024-01-02 directory (~99 MB)
-- Skips all other directories entirely

📅 By Time (Most Common)

Aligns with how queries filter (by date range) and how data arrives (daily batches, hourly streams).

🏢 By Business Entity

Multi-tenant SaaS or regional data with strict isolation needs. Partition by tenant_id or region.

⚠️ Over-Partitioning

Partitioning by (year, month, day, hour, user_id) = millions of tiny 1KB files. Query planning exceeds query execution!

📏 Right-Sizing Rule

Each partition should hold at least 128 MB (ideally 256 MB – 1 GB). Too small? Coarsen the key.

-- Delta Lake: partition by date
CREATE TABLE silver.clean_orders
USING DELTA
PARTITIONED BY (order_date)
AS SELECT * FROM transformed_orders;

-- Iceberg: hidden partitioning by month (game-changer!)
CREATE TABLE silver.clean_orders (
    order_id BIGINT, order_ts TIMESTAMP, amount DECIMAL(12,2)
) USING ICEBERG
PARTITIONED BY (months(order_ts));

-- Users query naturally — Iceberg prunes automatically
SELECT * FROM silver.clean_orders
WHERE order_ts BETWEEN '2024-01-15' AND '2024-01-20';

-- Partition evolution WITHOUT rewriting data!
ALTER TABLE silver.clean_orders
REPLACE PARTITION FIELD days(order_ts) WITH months(order_ts);

Factor	Recommendation	Example
Queries filter by date	Partition by date/month	PARTITIONED BY (order_date)
Multi-tenant isolation	Partition by tenant_id	PARTITIONED BY (tenant_id)
Table < 1 GB total	Don't partition at all	Small dimension tables
High-cardinality (user_id)	Don't partition — Z-ORDER	Z-ORDER BY (user_id)
Streaming (many small files)	Partition by date + auto-compact	Schedule OPTIMIZE regularly

📖 Analogy

Partitioning is like organizing a library. Partition by genre and you skip entire aisles. But partition by author's last name AND first name AND year AND edition? Millions of shelves, each holding a single pamphlet — the librarian spends more time walking than reading. Z-ordering sorts books within a shelf by multiple criteria.

💾 Integration with Storage Formats

Your data model lives inside Delta Lake, Iceberg, or Hudi. These table formats dictate what operations are efficient, which schema changes are safe, and how physical layout affects query performance.

🔄 MERGE for SCD2

Delta/Iceberg make MERGE efficient — the foundation for SCD2 in the lakehouse. Z-ORDER by the merge key for minimal file reads.

📊 Z-ORDER Alignment

Z-ORDER columns should match your model's most common join and filter keys. Fact tables → customer_id, product_id.

📈 Column Statistics

Put filtered columns first in table definitions. Formats collect stats for the first N columns — enables file pruning.

🔀 Schema Evolution

Iceberg tracks columns by ID (renames safe). Delta tracks by name (renames risky). Matters when Silver evolves over time.

-- SCD Type 2 via MERGE in Delta Lake
MERGE INTO silver.dim_customer AS target
USING (
    SELECT customer_id, customer_name, city,
        current_timestamp() AS effective_from
    FROM bronze.raw_customers_batch
) AS source
ON target.customer_id = source.customer_id
   AND target.is_current = true

-- Expire changed records
WHEN MATCHED AND (
    target.customer_name != source.customer_name OR
    target.city != source.city
) THEN UPDATE SET
    target.is_current = false,
    target.effective_to = source.effective_from

-- Insert new customers
WHEN NOT MATCHED THEN INSERT (
    customer_id, customer_name, city,
    effective_from, effective_to, is_current
) VALUES (
    source.customer_id, source.customer_name, source.city,
    source.effective_from, NULL, true
);

-- Z-ORDER + compaction: align with model grain
OPTIMIZE silver.clean_orders
ZORDER BY (customer_id, order_date);

-- Schema evolution: Iceberg is safer
ALTER TABLE silver.clean_customers
RENAME COLUMN customer_email TO email_address;
-- Iceberg: existing Parquet files NOT rewritten (maps by column ID)
-- Delta: would break references (tracks by column name)

Model Pattern	Z-ORDER Columns	Why
Fact table (orders)	customer_id, product_id	Common join keys for star queries
Event table (clickstream)	user_id, session_id	Filter/group by user then session
SCD2 dimension	business_key	MERGE finds all versions quickly
Aggregated Gold table	No Z-ORDER needed	Already small and pre-filtered
OBT (One Big Table)	Top 2-3 filter columns	Most impactful for wide scans

💡 Interview Insight

\"The table format dictates what's cheap. Delta and Iceberg make MERGE efficient — that's why SCD2 works well in the lakehouse. I align Z-ORDER with primary join/filter keys, schedule compaction after MERGE jobs targeting 128-256 MB files, and prefer Iceberg's hidden partitioning to decouple physical layout from queries. The partition key, Z-ORDER columns, and model grain must tell a consistent story.\"

🔥 Real-World Problems & Solutions

Every lakehouse hits the same failure modes. These four problems distinguish engineers who've operated production systems from those who've only read the docs.

🏚️ Problem 1: Silver Becomes a Swamp

Duplicate rows, no clear grain, multiple teams building overlapping Silver tables. Fix: grain contracts, PK constraints, post-load verification, single ownership.

🐌 Problem 2: Gold Tables Too Slow

Gold is just VIEWs on Silver — recomputed every query. Fix: materialize Gold tables, partition-align with dashboards, incremental MERGE.

💥 Problem 3: Schema Drift Breaks Pipes

Source adds columns or changes types. Fix: schema drift detection before writes, auto-evolve non-breaking changes, quarantine breaking ones.

📬 Problem 4: Late-Arriving Data

January 5th order arrives January 20th. Gold already computed. Fix: reprocessing windows — MERGE last N days at Gold on every run.

-- Enforce grain: post-load verification (MUST return 0 rows)
SELECT order_id, COUNT(*) AS row_count
FROM silver.clean_orders
GROUP BY order_id
HAVING COUNT(*) > 1;

-- Materialize Gold instead of VIEWs
CREATE TABLE gold.daily_sales_summary
USING DELTA PARTITIONED BY (order_date)
AS SELECT
    o.order_date, s.store_name,
    SUM(o.total_amount) AS total_revenue,
    COUNT(DISTINCT o.order_id) AS order_count
FROM silver.clean_orders o
JOIN silver.clean_stores s ON o.store_id = s.store_id
WHERE o.order_status = 'completed'
GROUP BY o.order_date, s.store_name;

-- Late-arriving data: reprocess last 7 days every run
MERGE INTO gold.daily_sales_summary AS target
USING (
    SELECT order_date, store_id,
        SUM(total_amount) AS total_revenue,
        COUNT(DISTINCT order_id) AS order_count
    FROM silver.clean_orders
    WHERE order_date >= current_date() - INTERVAL 7 DAYS
    GROUP BY order_date, store_id
) AS source
ON target.order_date = source.order_date
   AND target.store_id = source.store_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

📖 Analogy

Late-arriving data is like a postcard from vacation that arrives three weeks after you're home. Your photo album (Gold) already has a \"January 5th\" page. You don't throw away the album — you open to that page and add the postcard. The reprocessing window is checking your mailbox for late postcards every day for a week.

Quiz: Lakehouse Modeling Mastery

Q1: In Medallion Architecture, which layer should enforce schema and data types?

Q2: Silver table `clean_orders` has duplicates — same order_id appears 3 times. Most likely cause?

Q3: 200 GB Silver table, queries filter by last 30 days, frequent joins on customer_id. Optimal layout?

Q4: Source changes customer_id from INT to STRING. Delta has schema enforcement enabled. What happens?

Q5: Gold table `daily_sales_summary` is a VIEW. Dashboard queries take 45 seconds. Primary fix?

Q6: 50M customer SCD2 table, ~100K changes daily. Most efficient approach?

Q7: Iceberg table partitioned by `days(order_ts)`. Query uses `WHERE order_ts BETWEEN '2024-01-15 08:00' AND '2024-01-15 17:00'`. Does partition pruning occur?

🏅 Medallion Architecture

Bronze = completeness (raw, append-only). Silver = correctness (cleaned, deduplicated). Gold = performance (star schemas, OBTs). Each layer has a clear contract.

📋 Schema Strategy

Schema-on-read at Bronze, schema-on-write at Silver. Detect drift before it breaks pipes. Auto-evolve non-breaking changes; quarantine breaking ones.

🎯 Silver vs Gold

Silver = entity-centric (universally true). Gold = business-process-centric (specifically useful). Anti-pattern: star schemas at Silver = premature coupling.

📂 Partitioning

Low-cardinality WHERE columns. 128 MB–1 GB per partition. Z-ORDER for high-cardinality joins. Iceberg hidden partitioning decouples layout from queries.

💾 Storage Formats

MERGE = lakehouse SCD2. Align Z-ORDER with merge keys. Iceberg tracks columns by ID (safe renames). Compact to 128-256 MB files after writes.

🔥 Production Survival

Grain contracts prevent swamps. Materialize Gold (no VIEWs). Reprocessing windows for late data. Schema drift detection saves 3 AM pages.

🎓 The Big Picture

\"The lakehouse doesn't eliminate modeling — it changes where and when you model. Bronze defers decisions, Silver makes them, and Gold serves them. Get the boundaries right and everything downstream flows.\"

📘 Modules 1-3

Foundations: what modeling is, facts/dimensions/grain, star vs snowflake schemas.

📗 Modules 4-6

Core skills: SCDs, advanced patterns (bridge/junk/role-playing), interview gauntlet.

📕 Modules 7-9

Modern mastery: Data Vault, activity schema & OBTs, and lakehouse modeling.

🏅 Medallion Architecture — The Layered Foundation

🥉 Bronze (Raw)

🥈 Silver (Cleaned)

🥇 Gold (Business)

📋 Schema-on-Read vs Schema-on-Write

📝 Schema-on-Write

📖 Schema-on-Read

🎯 When to Model at Silver vs Gold

🥈 Model at Silver when…

🥇 Model at Gold when…

📂 Partitioning Strategy

📅 By Time (Most Common)

🏢 By Business Entity

⚠️ Over-Partitioning

📏 Right-Sizing Rule

💾 Integration with Storage Formats

🔄 MERGE for SCD2

📊 Z-ORDER Alignment

📈 Column Statistics

🔀 Schema Evolution

🔥 Real-World Problems & Solutions

🏚️ Problem 1: Silver Becomes a Swamp

🐌 Problem 2: Gold Tables Too Slow

💥 Problem 3: Schema Drift Breaks Pipes

📬 Problem 4: Late-Arriving Data

Quiz: Lakehouse Modeling Mastery

Q1: In Medallion Architecture, which layer should enforce schema and data types?

Q2: Silver table clean_orders has duplicates — same order_id appears 3 times. Most likely cause?

Q3: 200 GB Silver table, queries filter by last 30 days, frequent joins on customer_id. Optimal layout?

Q4: Source changes customer_id from INT to STRING. Delta has schema enforcement enabled. What happens?

Q5: Gold table daily_sales_summary is a VIEW. Dashboard queries take 45 seconds. Primary fix?

Q6: 50M customer SCD2 table, ~100K changes daily. Most efficient approach?

Q7: Iceberg table partitioned by days(order_ts). Query uses WHERE order_ts BETWEEN '2024-01-15 08:00' AND '2024-01-15 17:00'. Does partition pruning occur?

🏆 Course Complete — Key Takeaways

🏅 Medallion Architecture

📋 Schema Strategy

🎯 Silver vs Gold

📂 Partitioning

💾 Storage Formats

🔥 Production Survival

📘 Modules 1-3

📗 Modules 4-6

📕 Modules 7-9

Q2: Silver table `clean_orders` has duplicates — same order_id appears 3 times. Most likely cause?

Q5: Gold table `daily_sales_summary` is a VIEW. Dashboard queries take 45 seconds. Primary fix?

Q7: Iceberg table partitioned by `days(order_ts)`. Query uses `WHERE order_ts BETWEEN '2024-01-15 08:00' AND '2024-01-15 17:00'`. Does partition pruning occur?