Materialize when many downstream queries reuse expensive computations or when interactive BI requires low latency. The rule: materialize stable, high-value marts; keep exploratory logic as views.
Without materialization, those 10 dashboards each re-run the expensive join + aggregation — 10x the compute cost.
"I materialize when the ratio of reads to writes is high, the computation is expensive, and the data is stable enough to tolerate refresh lag. Otherwise I use views."
More tables isn't always better. Every materialized table adds storage cost + refresh compute + pipeline maintenance.
Tables no one queries. Multiple tables with 90% column overlap. Refresh jobs that cost more than the queries they serve.
Each table has ≥ 3 regular consumers. Refresh cost < 20% of ad-hoc re-computation cost. Clear ownership and SLAs.
Run a quarterly audit: for each materialized table, count downstream consumers and measure refresh cost vs query cost. Deprecate unused tables aggressively.
Caching sits between "compute every time" (view) and "persist forever" (table). Use it for intermediate results within a pipeline session.
Caches a DataFrame in memory (or disk) for reuse within the same Spark job. Freed when the session ends.
BigQuery, Snowflake, Redshift cache query results. Repeated identical queries return instantly at zero compute cost.
Databricks caches Parquet data on local SSDs. Subsequent reads of the same files are 10x faster. Automatic and transparent.
Full refresh reprocesses everything. Incremental only processes new or changed data. The cost difference can be 10-100x.
Use incremental for daily runs, but schedule a periodic full refresh (weekly/monthly) as a "correction pass" to fix any drift. This gives you low daily cost with guaranteed correctness over time.