🛡️ Preventing Runaway Queries

One bad query can consume an entire shared cluster for hours, blocking every other team. Workload management is your defense.

⏱️ Query Timeouts

Kill queries exceeding a time limit (e.g., 30 min). Catches accidental cross-joins and missing WHERE clauses.

🔢 Concurrency Limits

Cap simultaneous queries per user/team. Prevents one team from flooding the cluster with 200 parallel queries.

📊 Resource Quotas

Allocate CPU/memory per team. Team A gets 40%, Team B gets 30%, shared pool gets 30%.

🔒 Bytes-Scanned Limits

Reject queries that would scan > N TB. In BigQuery: maximumBytesBilled. Prevents $10K surprise bills.

-- BigQuery: limit bytes scanned per query -- Rejects if estimated scan > 1 TB SELECT * FROM large_table WHERE date = '2024-03-15' OPTIONS(maximum_bytes_billed=1099511627776); -- 1 TB -- Snowflake: statement timeout ALTER WAREHOUSE analytics_wh SET STATEMENT_TIMEOUT_IN_SECONDS = 1800; -- 30 min

Auto-Scaling & Cluster Sizing

The goal: enough capacity for peak load, no waste during quiet periods. Auto-scaling bridges the gap, but it's not free — scaling events take time and have overhead.

✅ Right-Sizing Checklist

  • Monitor CPU/memory utilization weekly
  • Target 65-80% utilization at peak
  • Auto-terminate idle clusters (15 min)
  • Use spot instances for worker nodes
  • Separate ETL and interactive workloads

🚫 Common Mistakes

  • Always-on clusters running at 10% utilization
  • One big cluster for all workloads
  • No auto-terminate — idle overnight/weekends
  • Over-provisioned "just in case"
  • Not using spot for fault-tolerant jobs
💡 Interview Gold

"I size clusters by profiling the workload on a small sample, then scaling up. I use auto-scaling with min/max bounds, auto-terminate after 15 minutes idle, and spot instances for batch workers."

Planning Backfill Costs

A backfill reprocesses historical data — days, months, or years. Without estimation, it can be the most expensive thing your team does all quarter.

Sample 1 Day
Measure: $15 / day
Extrapolate: 365 days
Total: ~$5,475

📐 Estimate First

Run on 1-day sample. Measure compute time, bytes scanned, shuffle. Multiply by backfill window.

📅 Stage the Backfill

Don't run 365 days at once. Process in weekly batches with checkpoints. If something breaks, you don't restart from zero.

🌙 Off-Peak Windows

Run backfills at night/weekends when clusters have spare capacity. Use spot instances for 60-90% savings.

🔙 Rollback Plan

Always have a rollback strategy. Write to a staging table first, validate, then swap. Never overwrite production in-place.

FinOps Metrics That Matter

You can't optimize what you don't measure. These are the metrics interviewers expect you to know.

$/pipeline
Cost per pipeline run
TB scanned
Bytes scanned per query
% utilized
Cluster utilization rate
$/team
Cost by team / project
GB growth
Storage growth by layer
WoW Δ
Week-over-week cost change

💰 Chargeback

  • Teams pay for their actual usage
  • Creates direct cost accountability
  • Requires granular cost attribution
  • Can cause friction between teams
  • Best for mature organizations

👁️ Showback

  • Teams see costs but don't pay directly
  • Awareness without enforcement
  • Lower friction, easier to implement
  • Less effective at changing behavior
  • Good starting point for FinOps adoption

Cost Guardrails in Practice

Embed cost awareness into the engineering workflow — not as an afterthought, but as part of the development process.

🔔 Budget Alerts

Set alerts at 50%, 80%, 100% of monthly budget. Slack/email notifications. Action owners for each threshold.

🔍 Query Linting

CI checks that flag SELECT *, missing partition filters, and cross-joins before they reach production.

📊 Cost Tags

Tag every resource with team, project, environment. Without tags, cost attribution is impossible.

📈 Trend Monitoring

A 10% WoW increase is a signal. Investigate before it compounds into a 50% monthly increase.

🎯 The PR Checklist

"Does this pipeline change increase bytes scanned? Does it add a new materialized table? What's the estimated monthly cost? Will it use spot or on-demand?" — Questions every PR reviewer should ask.

Quiz: Test Yourself

Q1: An analyst accidentally runs a cross-join on a 2 TB table. What guardrail prevents this?

Q2: What's the target utilization for a well-sized cluster?

Q3: How do you estimate the cost of a 1-year backfill?

Q4: You want to implement chargeback. What's the first step?

Q5: Your org has no cost culture. What's the best starting model?