📊 When to Materialize

Materialize when many downstream queries reuse expensive computations or when interactive BI requires low latency. The rule: materialize stable, high-value marts; keep exploratory logic as views.

Raw Data
Complex Join + Agg
Materialized Table
10 BI dashboards

Without materialization, those 10 dashboards each re-run the expensive join + aggregation — 10x the compute cost.

💡 Interview Gold

"I materialize when the ratio of reads to writes is high, the computation is expensive, and the data is stable enough to tolerate refresh lag. Otherwise I use views."

Views vs Materialized Views vs Tables

📝 View

  • Just a saved SQL query
  • Zero storage cost
  • Recomputes every access
  • Always fresh
  • Slow if logic is complex

🏗️ Materialized View

  • Pre-computed + stored
  • Storage cost exists
  • Fast reads, periodic refresh
  • May be stale between refreshes
  • Refresh has compute cost

📦 Table (Mart)

  • Full control over schema/partitioning
  • Storage + refresh compute cost
  • Best for high-traffic marts
  • Requires pipeline management
  • Most flexible optimization

The Cost of Over-Materialization

More tables isn't always better. Every materialized table adds storage cost + refresh compute + pipeline maintenance.

Storage: 3 copies
Refresh Compute
Pipeline Ops
Failures

🚫 Symptoms of Over-Materialization

Tables no one queries. Multiple tables with 90% column overlap. Refresh jobs that cost more than the queries they serve.

✅ Healthy Materialization

Each table has ≥ 3 regular consumers. Refresh cost < 20% of ad-hoc re-computation cost. Clear ownership and SLAs.

⚠️ Audit Regularly

Run a quarterly audit: for each materialized table, count downstream consumers and measure refresh cost vs query cost. Deprecate unused tables aggressively.

Caching Strategies

Caching sits between "compute every time" (view) and "persist forever" (table). Use it for intermediate results within a pipeline session.

💾 Spark .persist()

Caches a DataFrame in memory (or disk) for reuse within the same Spark job. Freed when the session ends.

🔄 Warehouse Result Cache

BigQuery, Snowflake, Redshift cache query results. Repeated identical queries return instantly at zero compute cost.

📊 Delta Cache

Databricks caches Parquet data on local SSDs. Subsequent reads of the same files are 10x faster. Automatic and transparent.

# Spark: cache a DataFrame used multiple times filtered = large_df.filter(col("status") == "active") filtered.persist(StorageLevel.MEMORY_AND_DISK) # Use it in multiple downstream operations count_by_region = filtered.groupBy("region").count() count_by_product = filtered.groupBy("product").count() # IMPORTANT: unpersist when done to free memory filtered.unpersist()

Incremental vs Full Refresh

Full refresh reprocesses everything. Incremental only processes new or changed data. The cost difference can be 10-100x.

🔄 Full Refresh

  • Reprocesses all historical data
  • Simple to implement
  • Correct by construction
  • Cost grows with table size
  • OK for small tables (< 10 GB)

📈 Incremental

  • Only processes new/changed data
  • Complex: needs watermarks, CDC
  • Must handle late-arriving data
  • Cost is ~constant regardless of table size
  • Essential for large tables (100+ GB)
🎯 The Hybrid Approach

Use incremental for daily runs, but schedule a periodic full refresh (weekly/monthly) as a "correction pass" to fix any drift. This gives you low daily cost with guaranteed correctness over time.

Quiz: Test Yourself

Q1: A complex 5-table join feeds 12 dashboards refreshed daily. Should you materialize it?

Q2: What's the main risk of over-materialization?

Q3: An analyst runs the same dashboard query 50 times a day. Best caching strategy?

Q4: Your table grows 5 GB/day and is now 2 TB. Full refresh takes 4 hours. What should you do?

Q5: What's the most common mistake with Spark .persist()?