Append-only β Deduplicated β Star schemas / OBTs
π₯ Bronze (Raw)
Store as-is from source. Append-only. Add metadata: _ingested_at, _source_file. Your immutable audit trail.
π₯ Silver (Cleaned)
Enforce types, deduplicate, resolve business keys, apply SCD logic. One row per entity per version.
π₯ Gold (Business)
Star schemas, OBTs, pre-aggregations. Purpose-built for dashboards, ML features, executive reports.
-- Example folder/table structure
lakehouse/
βββ bronze/
β βββ ecommerce/
β β βββ raw_orders/ -- partitioned by _ingested_date
β β βββ raw_customers/ -- partitioned by _ingested_date
β βββ erp/
β βββ raw_invoices/
βββ silver/
β βββ clean_orders/ -- partitioned by order_date
β βββ clean_customers/ -- no partition (small dim)
β βββ clean_products/ -- no partition (small dim)
βββ gold/
βββ daily_sales_summary/ -- partitioned by sale_date
βββ customer_360/ -- one big table
βββ customer_ltv/
π Analogy
Medallion is like a water treatment plant. Bronze is raw intake β pump in whatever comes from the river (mud, fish, debris). Silver is filtration β remove impurities, standardize pH, test for contaminants. Gold is tap water β clean, safe, optimized for drinking. You never throw away the raw intake logs because if someone gets sick, you trace back to the source.
π‘ Interview Insight
\"Medallion organizes lakehouse data into three layers of increasing quality. Bronze guarantees completeness, Silver guarantees correctness, and Gold guarantees performance. It works for both batch and streaming because the layer contracts are independent of the execution engine.\"
π Schema-on-Read vs Schema-on-Write
The question of when to enforce schema determines where your team spends debugging time, how fast you onboard new sources, and whether consumers trust the data.
π Schema-on-Write
β Errors caught immediately β no garbage in
β Consumers trust data β schema is a contract
β Predictable query performance
β οΈ Rigid β ALTER TABLE + ETL changes for new columns
β οΈ Slow onboarding β schema mapping first
β οΈ Upstream changes break pipelines instantly
π Schema-on-Read
β Fast ingestion β no upfront schema design
β Flexible β store anything, decide meaning later
β No data loss β every record preserved
β οΈ Errors shift downstream to query time
β οΈ Every reader reimplements parsing logic
β οΈ Data swamp risk without discipline
The lakehouse resolves this with a hybrid: schema-on-read at Bronze, schema-on-write at Silver.
-- Bronze: store raw, both API versions coexist peacefully-- v1: {"customer_id": 12345, "name": "Alice"}-- v2: {"customer_id": "CUST-12345", "name": "Alice"}-- Silver: transformation handles both versionsINSERT INTO silver.clean_customers
SELECTCASEWHENtypeof(raw:customer_id) = 'INTEGER'THENCONCAT('CUST-', CAST(raw:customer_id AS STRING))
ELSE raw:customer_id::STRING
ENDAS customer_id,
raw:name::STRING AS customer_name
FROM bronze.raw_customers
WHERE raw:customer_id IS NOT NULL;
Scenario
Strategy
Why
Financial / regulatory data
Strict schema-on-write
Compliance requires integrity
Exploratory / new sources
Schema-on-read at Bronze
Don't block ingestion
ML feature stores
Strict at Silver, flex at Bronze
Type consistency for training
Event / clickstream data
Schema-on-read at Bronze
High volume, evolving schema
Master data (customer, product)
Strict schema-on-write at Silver
Core entities need SSOT
π Analogy
Schema-on-write is airport security β check everything before boarding. Schema-on-read is a train station β anyone can board. The lakehouse hybrid is TSA PreCheck β let everyone into the terminal (Bronze), but validate thoroughly before they board the flight (Silver).
π― When to Model at Silver vs Gold
Everyone understands Bronze (dump raw) and Gold (build dashboards). The real craft is deciding how much modeling happens at Silver versus Gold.
π₯ Model at Silver whenβ¦
π Multiple Gold tables need the same cleaned entity
π You need a single source of truth for a business key
π Dedup/conforming logic should be applied once
π SCD history tracking is needed at entity level
π₯ Model at Gold whenβ¦
π Aggregation is specific to one business process
π You need to denormalize for query performance
π The model represents a specific KPI (CLV, churn)
-- β WRONG: Building star schemas in Silver (premature coupling)
silver.fact_sales -- Too specific for Silver
silver.dim_customer -- "dim_" implies Gold-level design
silver.bridge_promotions -- Way too specific-- β RIGHT: Entity-centric Silver
silver.clean_sales_transactions -- One row per transaction
silver.clean_customers -- One row per customer
silver.clean_products -- One row per product
-- SILVER: cleaned once, reused everywhereCREATE TABLE silver.clean_orders ASSELECT order_id, customer_id, order_date,
CAST(total_amount AS DECIMAL(12,2)) AS total_amount,
order_status, updated_at
FROM bronze.raw_orders
QUALIFYROW_NUMBER() OVER (
PARTITION BY order_id ORDER BY updated_at DESC
) = 1;
-- GOLD 1: Finance needs daily revenueCREATE TABLE gold.daily_revenue ASSELECT order_date, SUM(total_amount) AS total_revenue
FROM silver.clean_orders
WHERE order_status = 'completed'GROUP BY order_date;
-- GOLD 2: Marketing needs customer LTVCREATE TABLE gold.customer_ltv ASSELECT c.customer_id, SUM(o.total_amount) AS lifetime_value
FROM silver.clean_customers c
LEFT JOIN silver.clean_orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id;
π‘ Interview Insight
\"Dimensional modeling lives in the Gold layer. Silver tables are entity-centric β cleaned and conformed but not shaped for any specific business process. Model at Silver what's universally true about an entity; model at Gold what's specifically useful for a business question.\"
π Partitioning Strategy
In a lakehouse, your partition strategy is inseparable from your data model. The wrong partition key turns a sub-second query into a full-scan nightmare. The right one eliminates 99% of data scanned via partition pruning.
-- How partitioning works: file-system-level directories
delta_table/
βββ order_date=2024-01-01/
β βββ part-00000.parquet (50 MB)
β βββ part-00001.parquet (48 MB)
βββ order_date=2024-01-02/
β βββ part-00000.parquet (52 MB)
β βββ part-00001.parquet (47 MB)
-- Query: WHERE order_date = '2024-01-02'-- Engine reads ONLY the 2024-01-02 directory (~99 MB)-- Skips all other directories entirely
π By Time (Most Common)
Aligns with how queries filter (by date range) and how data arrives (daily batches, hourly streams).
π’ By Business Entity
Multi-tenant SaaS or regional data with strict isolation needs. Partition by tenant_id or region.
β οΈ Over-Partitioning
Partitioning by (year, month, day, hour, user_id) = millions of tiny 1KB files. Query planning exceeds query execution!
π Right-Sizing Rule
Each partition should hold at least 128 MB (ideally 256 MB β 1 GB). Too small? Coarsen the key.
-- Delta Lake: partition by dateCREATE TABLE silver.clean_orders
USING DELTA
PARTITIONED BY (order_date)
AS SELECT * FROM transformed_orders;
-- Iceberg: hidden partitioning by month (game-changer!)CREATE TABLE silver.clean_orders (
order_id BIGINT, order_ts TIMESTAMP, amount DECIMAL(12,2)
) USING ICEBERG
PARTITIONED BY (months(order_ts));
-- Users query naturally β Iceberg prunes automaticallySELECT * FROM silver.clean_orders
WHERE order_ts BETWEEN'2024-01-15'AND'2024-01-20';
-- Partition evolution WITHOUT rewriting data!ALTER TABLE silver.clean_orders
REPLACE PARTITION FIELDdays(order_ts) WITHmonths(order_ts);
Factor
Recommendation
Example
Queries filter by date
Partition by date/month
PARTITIONED BY (order_date)
Multi-tenant isolation
Partition by tenant_id
PARTITIONED BY (tenant_id)
Table < 1 GB total
Don't partition at all
Small dimension tables
High-cardinality (user_id)
Don't partition β Z-ORDER
Z-ORDER BY (user_id)
Streaming (many small files)
Partition by date + auto-compact
Schedule OPTIMIZE regularly
π Analogy
Partitioning is like organizing a library. Partition by genre and you skip entire aisles. But partition by author's last name AND first name AND year AND edition? Millions of shelves, each holding a single pamphlet β the librarian spends more time walking than reading. Z-ordering sorts books within a shelf by multiple criteria.
πΎ Integration with Storage Formats
Your data model lives inside Delta Lake, Iceberg, or Hudi. These table formats dictate what operations are efficient, which schema changes are safe, and how physical layout affects query performance.
π MERGE for SCD2
Delta/Iceberg make MERGE efficient β the foundation for SCD2 in the lakehouse. Z-ORDER by the merge key for minimal file reads.
π Z-ORDER Alignment
Z-ORDER columns should match your model's most common join and filter keys. Fact tables β customer_id, product_id.
π Column Statistics
Put filtered columns first in table definitions. Formats collect stats for the first N columns β enables file pruning.
π Schema Evolution
Iceberg tracks columns by ID (renames safe). Delta tracks by name (renames risky). Matters when Silver evolves over time.
-- SCD Type 2 via MERGE in Delta LakeMERGE INTO silver.dim_customer AS target
USING (
SELECT customer_id, customer_name, city,
current_timestamp() AS effective_from
FROM bronze.raw_customers_batch
) AS source
ON target.customer_id = source.customer_id
AND target.is_current = true-- Expire changed recordsWHEN MATCHED AND (
target.customer_name != source.customer_name OR
target.city != source.city
) THEN UPDATE SET
target.is_current = false,
target.effective_to = source.effective_from
-- Insert new customersWHEN NOT MATCHED THEN INSERT (
customer_id, customer_name, city,
effective_from, effective_to, is_current
) VALUES (
source.customer_id, source.customer_name, source.city,
source.effective_from, NULL, true
);
-- Z-ORDER + compaction: align with model grainOPTIMIZE silver.clean_orders
ZORDER BY (customer_id, order_date);
-- Schema evolution: Iceberg is saferALTER TABLE silver.clean_customers
RENAME COLUMN customer_email TO email_address;
-- Iceberg: existing Parquet files NOT rewritten (maps by column ID)-- Delta: would break references (tracks by column name)
Model Pattern
Z-ORDER Columns
Why
Fact table (orders)
customer_id, product_id
Common join keys for star queries
Event table (clickstream)
user_id, session_id
Filter/group by user then session
SCD2 dimension
business_key
MERGE finds all versions quickly
Aggregated Gold table
No Z-ORDER needed
Already small and pre-filtered
OBT (One Big Table)
Top 2-3 filter columns
Most impactful for wide scans
π‘ Interview Insight
\"The table format dictates what's cheap. Delta and Iceberg make MERGE efficient β that's why SCD2 works well in the lakehouse. I align Z-ORDER with primary join/filter keys, schedule compaction after MERGE jobs targeting 128-256 MB files, and prefer Iceberg's hidden partitioning to decouple physical layout from queries. The partition key, Z-ORDER columns, and model grain must tell a consistent story.\"
π₯ Real-World Problems & Solutions
Every lakehouse hits the same failure modes. These four problems distinguish engineers who've operated production systems from those who've only read the docs.
ποΈ Problem 1: Silver Becomes a Swamp
Duplicate rows, no clear grain, multiple teams building overlapping Silver tables. Fix: grain contracts, PK constraints, post-load verification, single ownership.
π Problem 2: Gold Tables Too Slow
Gold is just VIEWs on Silver β recomputed every query. Fix: materialize Gold tables, partition-align with dashboards, incremental MERGE.
π₯ Problem 3: Schema Drift Breaks Pipes
Source adds columns or changes types. Fix: schema drift detection before writes, auto-evolve non-breaking changes, quarantine breaking ones.
π¬ Problem 4: Late-Arriving Data
January 5th order arrives January 20th. Gold already computed. Fix: reprocessing windows β MERGE last N days at Gold on every run.
-- Enforce grain: post-load verification (MUST return 0 rows)SELECT order_id, COUNT(*) AS row_count
FROM silver.clean_orders
GROUP BY order_id
HAVINGCOUNT(*) > 1;
-- Materialize Gold instead of VIEWsCREATE TABLE gold.daily_sales_summary
USING DELTA PARTITIONED BY (order_date)
AS SELECT
o.order_date, s.store_name,
SUM(o.total_amount) AS total_revenue,
COUNT(DISTINCT o.order_id) AS order_count
FROM silver.clean_orders o
JOIN silver.clean_stores s ON o.store_id = s.store_id
WHERE o.order_status = 'completed'GROUP BY o.order_date, s.store_name;
-- Late-arriving data: reprocess last 7 days every runMERGE INTO gold.daily_sales_summary AS target
USING (
SELECT order_date, store_id,
SUM(total_amount) AS total_revenue,
COUNT(DISTINCT order_id) AS order_count
FROM silver.clean_orders
WHERE order_date >= current_date() - INTERVAL7 DAYS
GROUP BY order_date, store_id
) AS source
ON target.order_date = source.order_date
AND target.store_id = source.store_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
π Analogy
Late-arriving data is like a postcard from vacation that arrives three weeks after you're home. Your photo album (Gold) already has a \"January 5th\" page. You don't throw away the album β you open to that page and add the postcard. The reprocessing window is checking your mailbox for late postcards every day for a week.
Quiz: Lakehouse Modeling Mastery
Q1: In Medallion Architecture, which layer should enforce schema and data types?
Q2: Silver table clean_orders has duplicates β same order_id appears 3 times. Most likely cause?
Q3: 200 GB Silver table, queries filter by last 30 days, frequent joins on customer_id. Optimal layout?
Q4: Source changes customer_id from INT to STRING. Delta has schema enforcement enabled. What happens?
Q5: Gold table daily_sales_summary is a VIEW. Dashboard queries take 45 seconds. Primary fix?
Q7: Iceberg table partitioned by days(order_ts). Query uses WHERE order_ts BETWEEN '2024-01-15 08:00' AND '2024-01-15 17:00'. Does partition pruning occur?
π
π Course Complete β Key Takeaways
You've mastered all 9 modules. Here's your lakehouse modeling cheat sheet β and a summary of the entire course.
π Medallion Architecture
Bronze = completeness (raw, append-only). Silver = correctness (cleaned, deduplicated). Gold = performance (star schemas, OBTs). Each layer has a clear contract.
π Schema Strategy
Schema-on-read at Bronze, schema-on-write at Silver. Detect drift before it breaks pipes. Auto-evolve non-breaking changes; quarantine breaking ones.
π― Silver vs Gold
Silver = entity-centric (universally true). Gold = business-process-centric (specifically useful). Anti-pattern: star schemas at Silver = premature coupling.
π Partitioning
Low-cardinality WHERE columns. 128 MBβ1 GB per partition. Z-ORDER for high-cardinality joins. Iceberg hidden partitioning decouples layout from queries.
πΎ Storage Formats
MERGE = lakehouse SCD2. Align Z-ORDER with merge keys. Iceberg tracks columns by ID (safe renames). Compact to 128-256 MB files after writes.
π₯ Production Survival
Grain contracts prevent swamps. Materialize Gold (no VIEWs). Reprocessing windows for late data. Schema drift detection saves 3 AM pages.
π The Big Picture
\"The lakehouse doesn't eliminate modeling β it changes where and when you model. Bronze defers decisions, Silver makes them, and Gold serves them. Get the boundaries right and everything downstream flows.\"
π Modules 1-3
Foundations: what modeling is, facts/dimensions/grain, star vs snowflake schemas.