One bad query can consume an entire shared cluster for hours, blocking every other team. Workload management is your defense.
Kill queries exceeding a time limit (e.g., 30 min). Catches accidental cross-joins and missing WHERE clauses.
Cap simultaneous queries per user/team. Prevents one team from flooding the cluster with 200 parallel queries.
Allocate CPU/memory per team. Team A gets 40%, Team B gets 30%, shared pool gets 30%.
Reject queries that would scan > N TB. In BigQuery: maximumBytesBilled. Prevents $10K surprise bills.
The goal: enough capacity for peak load, no waste during quiet periods. Auto-scaling bridges the gap, but it's not free — scaling events take time and have overhead.
"I size clusters by profiling the workload on a small sample, then scaling up. I use auto-scaling with min/max bounds, auto-terminate after 15 minutes idle, and spot instances for batch workers."
A backfill reprocesses historical data — days, months, or years. Without estimation, it can be the most expensive thing your team does all quarter.
Run on 1-day sample. Measure compute time, bytes scanned, shuffle. Multiply by backfill window.
Don't run 365 days at once. Process in weekly batches with checkpoints. If something breaks, you don't restart from zero.
Run backfills at night/weekends when clusters have spare capacity. Use spot instances for 60-90% savings.
Always have a rollback strategy. Write to a staging table first, validate, then swap. Never overwrite production in-place.
You can't optimize what you don't measure. These are the metrics interviewers expect you to know.
Embed cost awareness into the engineering workflow — not as an afterthought, but as part of the development process.
Set alerts at 50%, 80%, 100% of monthly budget. Slack/email notifications. Action owners for each threshold.
CI checks that flag SELECT *, missing partition filters, and cross-joins before they reach production.
Tag every resource with team, project, environment. Without tags, cost attribution is impossible.
A 10% WoW increase is a signal. Investigate before it compounds into a 50% monthly increase.
"Does this pipeline change increase bytes scanned? Does it add a new materialized table? What's the estimated monthly cost? Will it use spot or on-demand?" — Questions every PR reviewer should ask.