Module 4: Materialization & Pre-Aggregation

📊 When to Materialize

Materialize when many downstream queries reuse expensive computations or when interactive BI requires low latency. The rule: materialize stable, high-value marts; keep exploratory logic as views.

Raw Data

→

Complex Join + Agg

→

Materialized Table

→

10 BI dashboards

Without materialization, those 10 dashboards each re-run the expensive join + aggregation — 10x the compute cost.

💡 Interview Gold

"I materialize when the ratio of reads to writes is high, the computation is expensive, and the data is stable enough to tolerate refresh lag. Otherwise I use views."

Views vs Materialized Views vs Tables

📝 View

Just a saved SQL query
Zero storage cost
Recomputes every access
Always fresh
Slow if logic is complex

🏗️ Materialized View

Pre-computed + stored
Storage cost exists
Fast reads, periodic refresh
May be stale between refreshes
Refresh has compute cost

📦 Table (Mart)

Full control over schema/partitioning
Storage + refresh compute cost
Best for high-traffic marts
Requires pipeline management
Most flexible optimization

The Cost of Over-Materialization

More tables isn't always better. Every materialized table adds storage cost + refresh compute + pipeline maintenance.

Storage: 3 copies

Refresh Compute

Pipeline Ops

Failures

🚫 Symptoms of Over-Materialization

Tables no one queries. Multiple tables with 90% column overlap. Refresh jobs that cost more than the queries they serve.

✅ Healthy Materialization

Each table has ≥ 3 regular consumers. Refresh cost < 20% of ad-hoc re-computation cost. Clear ownership and SLAs.

⚠️ Audit Regularly

Run a quarterly audit: for each materialized table, count downstream consumers and measure refresh cost vs query cost. Deprecate unused tables aggressively.

Caching Strategies

Caching sits between "compute every time" (view) and "persist forever" (table). Use it for intermediate results within a pipeline session.

💾 Spark .persist()

Caches a DataFrame in memory (or disk) for reuse within the same Spark job. Freed when the session ends.

🔄 Warehouse Result Cache

BigQuery, Snowflake, Redshift cache query results. Repeated identical queries return instantly at zero compute cost.

📊 Delta Cache

Databricks caches Parquet data on local SSDs. Subsequent reads of the same files are 10x faster. Automatic and transparent.

# Spark: cache a DataFrame used multiple times
filtered = large_df.filter(col("status") == "active")
filtered.persist(StorageLevel.MEMORY_AND_DISK)

# Use it in multiple downstream operations
count_by_region = filtered.groupBy("region").count()
count_by_product = filtered.groupBy("product").count()

# IMPORTANT: unpersist when done to free memory
filtered.unpersist()

Incremental vs Full Refresh

Full refresh reprocesses everything. Incremental only processes new or changed data. The cost difference can be 10-100x.

🔄 Full Refresh

Reprocesses all historical data
Simple to implement
Correct by construction
Cost grows with table size
OK for small tables (< 10 GB)

📈 Incremental

Only processes new/changed data
Complex: needs watermarks, CDC
Must handle late-arriving data
Cost is ~constant regardless of table size
Essential for large tables (100+ GB)

🎯 The Hybrid Approach

Use incremental for daily runs, but schedule a periodic full refresh (weekly/monthly) as a "correction pass" to fix any drift. This gives you low daily cost with guaranteed correctness over time.

📊 When to Materialize

Views vs Materialized Views vs Tables

📝 View

🏗️ Materialized View

📦 Table (Mart)

The Cost of Over-Materialization

🚫 Symptoms of Over-Materialization

✅ Healthy Materialization

Caching Strategies

💾 Spark .persist()

🔄 Warehouse Result Cache

📊 Delta Cache

Incremental vs Full Refresh

🔄 Full Refresh

📈 Incremental

Quiz: Test Yourself

Q1: A complex 5-table join feeds 12 dashboards refreshed daily. Should you materialize it?

Q2: What's the main risk of over-materialization?

Q3: An analyst runs the same dashboard query 50 times a day. Best caching strategy?

Q4: Your table grows 5 GB/day and is now 2 TB. Full refresh takes 4 hours. What should you do?

Q5: What's the most common mistake with Spark .persist()?