πŸ… Medallion Architecture β€” The Layered Foundation

The Medallion Architecture solves the data swamp problem by organizing data into three layers of increasing quality: Bronze, Silver, and Gold.

πŸ“₯ Sources
β†’
πŸ₯‰ Bronze
β†’
πŸ₯ˆ Silver
β†’
πŸ₯‡ Gold
Schema-on-read β†’ Schema enforced β†’ Business-optimized
Append-only β†’ Deduplicated β†’ Star schemas / OBTs

πŸ₯‰ Bronze (Raw)

Store as-is from source. Append-only. Add metadata: _ingested_at, _source_file. Your immutable audit trail.

πŸ₯ˆ Silver (Cleaned)

Enforce types, deduplicate, resolve business keys, apply SCD logic. One row per entity per version.

πŸ₯‡ Gold (Business)

Star schemas, OBTs, pre-aggregations. Purpose-built for dashboards, ML features, executive reports.

-- Example folder/table structure lakehouse/ β”œβ”€β”€ bronze/ β”‚ β”œβ”€β”€ ecommerce/ β”‚ β”‚ β”œβ”€β”€ raw_orders/ -- partitioned by _ingested_date β”‚ β”‚ └── raw_customers/ -- partitioned by _ingested_date β”‚ └── erp/ β”‚ └── raw_invoices/ β”œβ”€β”€ silver/ β”‚ β”œβ”€β”€ clean_orders/ -- partitioned by order_date β”‚ β”œβ”€β”€ clean_customers/ -- no partition (small dim) β”‚ └── clean_products/ -- no partition (small dim) └── gold/ β”œβ”€β”€ daily_sales_summary/ -- partitioned by sale_date β”œβ”€β”€ customer_360/ -- one big table └── customer_ltv/
πŸ“– Analogy

Medallion is like a water treatment plant. Bronze is raw intake β€” pump in whatever comes from the river (mud, fish, debris). Silver is filtration β€” remove impurities, standardize pH, test for contaminants. Gold is tap water β€” clean, safe, optimized for drinking. You never throw away the raw intake logs because if someone gets sick, you trace back to the source.

πŸ’‘ Interview Insight

\"Medallion organizes lakehouse data into three layers of increasing quality. Bronze guarantees completeness, Silver guarantees correctness, and Gold guarantees performance. It works for both batch and streaming because the layer contracts are independent of the execution engine.\"

πŸ“‹ Schema-on-Read vs Schema-on-Write

The question of when to enforce schema determines where your team spends debugging time, how fast you onboard new sources, and whether consumers trust the data.

πŸ“ Schema-on-Write

  • βœ… Errors caught immediately β€” no garbage in
  • βœ… Consumers trust data β€” schema is a contract
  • βœ… Predictable query performance
  • ⚠️ Rigid β€” ALTER TABLE + ETL changes for new columns
  • ⚠️ Slow onboarding β€” schema mapping first
  • ⚠️ Upstream changes break pipelines instantly

πŸ“– Schema-on-Read

  • βœ… Fast ingestion β€” no upfront schema design
  • βœ… Flexible β€” store anything, decide meaning later
  • βœ… No data loss β€” every record preserved
  • ⚠️ Errors shift downstream to query time
  • ⚠️ Every reader reimplements parsing logic
  • ⚠️ Data swamp risk without discipline

The lakehouse resolves this with a hybrid: schema-on-read at Bronze, schema-on-write at Silver.

-- Bronze: store raw, both API versions coexist peacefully -- v1: {"customer_id": 12345, "name": "Alice"} -- v2: {"customer_id": "CUST-12345", "name": "Alice"} -- Silver: transformation handles both versions INSERT INTO silver.clean_customers SELECT CASE WHEN typeof(raw:customer_id) = 'INTEGER' THEN CONCAT('CUST-', CAST(raw:customer_id AS STRING)) ELSE raw:customer_id::STRING END AS customer_id, raw:name::STRING AS customer_name FROM bronze.raw_customers WHERE raw:customer_id IS NOT NULL;
ScenarioStrategyWhy
Financial / regulatory dataStrict schema-on-writeCompliance requires integrity
Exploratory / new sourcesSchema-on-read at BronzeDon't block ingestion
ML feature storesStrict at Silver, flex at BronzeType consistency for training
Event / clickstream dataSchema-on-read at BronzeHigh volume, evolving schema
Master data (customer, product)Strict schema-on-write at SilverCore entities need SSOT
πŸ“– Analogy

Schema-on-write is airport security β€” check everything before boarding. Schema-on-read is a train station β€” anyone can board. The lakehouse hybrid is TSA PreCheck β€” let everyone into the terminal (Bronze), but validate thoroughly before they board the flight (Silver).

🎯 When to Model at Silver vs Gold

Everyone understands Bronze (dump raw) and Gold (build dashboards). The real craft is deciding how much modeling happens at Silver versus Gold.

πŸ₯ˆ Model at Silver when…

  • πŸ“Œ Multiple Gold tables need the same cleaned entity
  • πŸ“Œ You need a single source of truth for a business key
  • πŸ“Œ Dedup/conforming logic should be applied once
  • πŸ“Œ SCD history tracking is needed at entity level

πŸ₯‡ Model at Gold when…

  • πŸ“Œ Aggregation is specific to one business process
  • πŸ“Œ You need to denormalize for query performance
  • πŸ“Œ Pre-computation saves significant dashboard runtime
  • πŸ“Œ The model represents a specific KPI (CLV, churn)
-- ❌ WRONG: Building star schemas in Silver (premature coupling) silver.fact_sales -- Too specific for Silver silver.dim_customer -- "dim_" implies Gold-level design silver.bridge_promotions -- Way too specific -- βœ… RIGHT: Entity-centric Silver silver.clean_sales_transactions -- One row per transaction silver.clean_customers -- One row per customer silver.clean_products -- One row per product
-- SILVER: cleaned once, reused everywhere CREATE TABLE silver.clean_orders AS SELECT order_id, customer_id, order_date, CAST(total_amount AS DECIMAL(12,2)) AS total_amount, order_status, updated_at FROM bronze.raw_orders QUALIFY ROW_NUMBER() OVER ( PARTITION BY order_id ORDER BY updated_at DESC ) = 1; -- GOLD 1: Finance needs daily revenue CREATE TABLE gold.daily_revenue AS SELECT order_date, SUM(total_amount) AS total_revenue FROM silver.clean_orders WHERE order_status = 'completed' GROUP BY order_date; -- GOLD 2: Marketing needs customer LTV CREATE TABLE gold.customer_ltv AS SELECT c.customer_id, SUM(o.total_amount) AS lifetime_value FROM silver.clean_customers c LEFT JOIN silver.clean_orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id;
πŸ’‘ Interview Insight

\"Dimensional modeling lives in the Gold layer. Silver tables are entity-centric β€” cleaned and conformed but not shaped for any specific business process. Model at Silver what's universally true about an entity; model at Gold what's specifically useful for a business question.\"

πŸ“‚ Partitioning Strategy

In a lakehouse, your partition strategy is inseparable from your data model. The wrong partition key turns a sub-second query into a full-scan nightmare. The right one eliminates 99% of data scanned via partition pruning.

-- How partitioning works: file-system-level directories delta_table/ β”œβ”€β”€ order_date=2024-01-01/ β”‚ β”œβ”€β”€ part-00000.parquet (50 MB) β”‚ └── part-00001.parquet (48 MB) β”œβ”€β”€ order_date=2024-01-02/ β”‚ β”œβ”€β”€ part-00000.parquet (52 MB) β”‚ └── part-00001.parquet (47 MB) -- Query: WHERE order_date = '2024-01-02' -- Engine reads ONLY the 2024-01-02 directory (~99 MB) -- Skips all other directories entirely

πŸ“… By Time (Most Common)

Aligns with how queries filter (by date range) and how data arrives (daily batches, hourly streams).

🏒 By Business Entity

Multi-tenant SaaS or regional data with strict isolation needs. Partition by tenant_id or region.

⚠️ Over-Partitioning

Partitioning by (year, month, day, hour, user_id) = millions of tiny 1KB files. Query planning exceeds query execution!

πŸ“ Right-Sizing Rule

Each partition should hold at least 128 MB (ideally 256 MB – 1 GB). Too small? Coarsen the key.

-- Delta Lake: partition by date CREATE TABLE silver.clean_orders USING DELTA PARTITIONED BY (order_date) AS SELECT * FROM transformed_orders; -- Iceberg: hidden partitioning by month (game-changer!) CREATE TABLE silver.clean_orders ( order_id BIGINT, order_ts TIMESTAMP, amount DECIMAL(12,2) ) USING ICEBERG PARTITIONED BY (months(order_ts)); -- Users query naturally β€” Iceberg prunes automatically SELECT * FROM silver.clean_orders WHERE order_ts BETWEEN '2024-01-15' AND '2024-01-20'; -- Partition evolution WITHOUT rewriting data! ALTER TABLE silver.clean_orders REPLACE PARTITION FIELD days(order_ts) WITH months(order_ts);
FactorRecommendationExample
Queries filter by datePartition by date/monthPARTITIONED BY (order_date)
Multi-tenant isolationPartition by tenant_idPARTITIONED BY (tenant_id)
Table < 1 GB totalDon't partition at allSmall dimension tables
High-cardinality (user_id)Don't partition β€” Z-ORDERZ-ORDER BY (user_id)
Streaming (many small files)Partition by date + auto-compactSchedule OPTIMIZE regularly
πŸ“– Analogy

Partitioning is like organizing a library. Partition by genre and you skip entire aisles. But partition by author's last name AND first name AND year AND edition? Millions of shelves, each holding a single pamphlet β€” the librarian spends more time walking than reading. Z-ordering sorts books within a shelf by multiple criteria.

πŸ’Ύ Integration with Storage Formats

Your data model lives inside Delta Lake, Iceberg, or Hudi. These table formats dictate what operations are efficient, which schema changes are safe, and how physical layout affects query performance.

πŸ”„ MERGE for SCD2

Delta/Iceberg make MERGE efficient β€” the foundation for SCD2 in the lakehouse. Z-ORDER by the merge key for minimal file reads.

πŸ“Š Z-ORDER Alignment

Z-ORDER columns should match your model's most common join and filter keys. Fact tables β†’ customer_id, product_id.

πŸ“ˆ Column Statistics

Put filtered columns first in table definitions. Formats collect stats for the first N columns β€” enables file pruning.

πŸ”€ Schema Evolution

Iceberg tracks columns by ID (renames safe). Delta tracks by name (renames risky). Matters when Silver evolves over time.

-- SCD Type 2 via MERGE in Delta Lake MERGE INTO silver.dim_customer AS target USING ( SELECT customer_id, customer_name, city, current_timestamp() AS effective_from FROM bronze.raw_customers_batch ) AS source ON target.customer_id = source.customer_id AND target.is_current = true -- Expire changed records WHEN MATCHED AND ( target.customer_name != source.customer_name OR target.city != source.city ) THEN UPDATE SET target.is_current = false, target.effective_to = source.effective_from -- Insert new customers WHEN NOT MATCHED THEN INSERT ( customer_id, customer_name, city, effective_from, effective_to, is_current ) VALUES ( source.customer_id, source.customer_name, source.city, source.effective_from, NULL, true );
-- Z-ORDER + compaction: align with model grain OPTIMIZE silver.clean_orders ZORDER BY (customer_id, order_date); -- Schema evolution: Iceberg is safer ALTER TABLE silver.clean_customers RENAME COLUMN customer_email TO email_address; -- Iceberg: existing Parquet files NOT rewritten (maps by column ID) -- Delta: would break references (tracks by column name)
Model PatternZ-ORDER ColumnsWhy
Fact table (orders)customer_id, product_idCommon join keys for star queries
Event table (clickstream)user_id, session_idFilter/group by user then session
SCD2 dimensionbusiness_keyMERGE finds all versions quickly
Aggregated Gold tableNo Z-ORDER neededAlready small and pre-filtered
OBT (One Big Table)Top 2-3 filter columnsMost impactful for wide scans
πŸ’‘ Interview Insight

\"The table format dictates what's cheap. Delta and Iceberg make MERGE efficient β€” that's why SCD2 works well in the lakehouse. I align Z-ORDER with primary join/filter keys, schedule compaction after MERGE jobs targeting 128-256 MB files, and prefer Iceberg's hidden partitioning to decouple physical layout from queries. The partition key, Z-ORDER columns, and model grain must tell a consistent story.\"

πŸ”₯ Real-World Problems & Solutions

Every lakehouse hits the same failure modes. These four problems distinguish engineers who've operated production systems from those who've only read the docs.

🏚️ Problem 1: Silver Becomes a Swamp

Duplicate rows, no clear grain, multiple teams building overlapping Silver tables. Fix: grain contracts, PK constraints, post-load verification, single ownership.

🐌 Problem 2: Gold Tables Too Slow

Gold is just VIEWs on Silver β€” recomputed every query. Fix: materialize Gold tables, partition-align with dashboards, incremental MERGE.

πŸ’₯ Problem 3: Schema Drift Breaks Pipes

Source adds columns or changes types. Fix: schema drift detection before writes, auto-evolve non-breaking changes, quarantine breaking ones.

πŸ“¬ Problem 4: Late-Arriving Data

January 5th order arrives January 20th. Gold already computed. Fix: reprocessing windows β€” MERGE last N days at Gold on every run.

-- Enforce grain: post-load verification (MUST return 0 rows) SELECT order_id, COUNT(*) AS row_count FROM silver.clean_orders GROUP BY order_id HAVING COUNT(*) > 1; -- Materialize Gold instead of VIEWs CREATE TABLE gold.daily_sales_summary USING DELTA PARTITIONED BY (order_date) AS SELECT o.order_date, s.store_name, SUM(o.total_amount) AS total_revenue, COUNT(DISTINCT o.order_id) AS order_count FROM silver.clean_orders o JOIN silver.clean_stores s ON o.store_id = s.store_id WHERE o.order_status = 'completed' GROUP BY o.order_date, s.store_name;
-- Late-arriving data: reprocess last 7 days every run MERGE INTO gold.daily_sales_summary AS target USING ( SELECT order_date, store_id, SUM(total_amount) AS total_revenue, COUNT(DISTINCT order_id) AS order_count FROM silver.clean_orders WHERE order_date >= current_date() - INTERVAL 7 DAYS GROUP BY order_date, store_id ) AS source ON target.order_date = source.order_date AND target.store_id = source.store_id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *;
πŸ“– Analogy

Late-arriving data is like a postcard from vacation that arrives three weeks after you're home. Your photo album (Gold) already has a \"January 5th\" page. You don't throw away the album β€” you open to that page and add the postcard. The reprocessing window is checking your mailbox for late postcards every day for a week.

Quiz: Lakehouse Modeling Mastery

Q1: In Medallion Architecture, which layer should enforce schema and data types?

Q2: Silver table clean_orders has duplicates β€” same order_id appears 3 times. Most likely cause?

Q3: 200 GB Silver table, queries filter by last 30 days, frequent joins on customer_id. Optimal layout?

Q4: Source changes customer_id from INT to STRING. Delta has schema enforcement enabled. What happens?

Q5: Gold table daily_sales_summary is a VIEW. Dashboard queries take 45 seconds. Primary fix?

Q6: 50M customer SCD2 table, ~100K changes daily. Most efficient approach?

Q7: Iceberg table partitioned by days(order_ts). Query uses WHERE order_ts BETWEEN '2024-01-15 08:00' AND '2024-01-15 17:00'. Does partition pruning occur?

πŸŽ“

πŸ† Course Complete β€” Key Takeaways

You've mastered all 9 modules. Here's your lakehouse modeling cheat sheet β€” and a summary of the entire course.

πŸ… Medallion Architecture

Bronze = completeness (raw, append-only). Silver = correctness (cleaned, deduplicated). Gold = performance (star schemas, OBTs). Each layer has a clear contract.

πŸ“‹ Schema Strategy

Schema-on-read at Bronze, schema-on-write at Silver. Detect drift before it breaks pipes. Auto-evolve non-breaking changes; quarantine breaking ones.

🎯 Silver vs Gold

Silver = entity-centric (universally true). Gold = business-process-centric (specifically useful). Anti-pattern: star schemas at Silver = premature coupling.

πŸ“‚ Partitioning

Low-cardinality WHERE columns. 128 MB–1 GB per partition. Z-ORDER for high-cardinality joins. Iceberg hidden partitioning decouples layout from queries.

πŸ’Ύ Storage Formats

MERGE = lakehouse SCD2. Align Z-ORDER with merge keys. Iceberg tracks columns by ID (safe renames). Compact to 128-256 MB files after writes.

πŸ”₯ Production Survival

Grain contracts prevent swamps. Materialize Gold (no VIEWs). Reprocessing windows for late data. Schema drift detection saves 3 AM pages.

πŸŽ“ The Big Picture

\"The lakehouse doesn't eliminate modeling β€” it changes where and when you model. Bronze defers decisions, Silver makes them, and Gold serves them. Get the boundaries right and everything downstream flows.\"

πŸ“˜ Modules 1-3

Foundations: what modeling is, facts/dimensions/grain, star vs snowflake schemas.

πŸ“— Modules 4-6

Core skills: SCDs, advanced patterns (bridge/junk/role-playing), interview gauntlet.

πŸ“• Modules 7-9

Modern mastery: Data Vault, activity schema & OBTs, and lakehouse modeling.