Module 6: The Interview Gauntlet | Cost Optimization Mastery

🔥 The Interview Gauntlet

This is the final test. 15 tricky scenario questions, debugging exercises, and gotchas that trip up experienced engineers. Read the scenario, think, then reveal the answer.

🎯 How to Use This Module

For each scenario, pause and formulate your answer before clicking "Reveal." In a real interview, you need to think out loud — practice that here.

Scenario 1: The $10K/day Spark Job

"Your team's daily Spark ETL job costs $10,000/day. The CEO wants it halved. Where do you start?"

Step 1: Profile the job — check Spark UI for shuffle, skew, and scan sizes. Step 2: Check partition pruning — is the query scanning the full table? Step 3: Look at join strategies — can any be broadcast? Step 4: Check cluster utilization — is it over-provisioned? Step 5: Move to spot instances for workers. Step 6: Consider incremental processing instead of full refresh. Typically, the biggest wins come from reducing bytes scanned and fixing data skew.

Scenario 2: Storage Bill Tripled

"Your S3 storage bill tripled in 3 months but data volume only grew 20%. What happened?"

Likely causes: 1) No compaction — streaming writes created millions of small files (each with overhead). 2) No lifecycle policies — old versions and deleted files aren't being cleaned up (Delta Lake/Iceberg vacuum). 3) Duplicate materialized tables that nobody uses. 4) Uncleaned staging/temp data. Fix: run VACUUM, set lifecycle policies, audit unused tables, enable auto-compaction.

Scenario 3: Cluster at 15% Utilization

"Your always-on Spark cluster averages 15% CPU utilization but the team insists they need it 24/7. What do you do?"

1) Analyze usage patterns — when are the actual peak hours? 2) Enable auto-terminate after 15 min idle. 3) Move to an auto-scaling cluster with min/max bounds. 4) Schedule batch jobs in windows, not ad-hoc. 5) Consider serverless (Databricks Serverless, EMR Serverless) — pay only when running. A 15% → 70% utilization improvement saves ~80% of compute cost.

More Scenarios

Scenario 4: The Rogue Analyst

"An analyst ran SELECT * FROM a 50 TB table in BigQuery (on-demand pricing). The query cost $250. How do you prevent this?"

1) Set maximumBytesBilled per query (e.g., 1 TB limit). 2) Require partition filters on large tables. 3) Create authorized views that pre-filter data. 4) Set up project-level cost alerts. 5) Consider switching heavy users to flat-rate BigQuery slots. 6) Educate users: always use WHERE clauses and SELECT only needed columns.

Scenario 5: Backfill Gone Wrong

"A backfill of 2 years of data was estimated at $3K but actually cost $18K. What went wrong?"

Common causes: 1) Estimated from a recent (small) day — older data may have different volumes. 2) Data skew in historical data caused stragglers. 3) No checkpointing — job failed at month 8 and restarted from scratch. 4) Cluster auto-scaled to max during peak contention. 5) Didn't use spot instances. Fix: sample from multiple time periods, stage in batches with checkpoints, use spot, set budget alerts.

Scenario 6: Join Taking 4 Hours

"A join between a 200 GB table and a 500 GB table takes 4 hours. Both are well-partitioned. What do you check?"

1) Check for data skew on the join key — is one key responsible for 80% of data? 2) Check shuffle partition count (default 200 may be too few for 700 GB). 3) Are both tables sorted/bucketed on the join key? Bucketing eliminates shuffle. 4) Can you filter either table before the join? 5) Enable AQE for automatic skew handling. 6) Check if columns are being pruned — SELECT only what's needed.

Scenario 7: Streaming Micro-Batch Costs

"Your Kafka-to-Delta streaming job produces 50K files/day. Storage costs are growing 5x faster than expected."

1) Increase trigger interval — from 30s to 5-10 min if latency allows. 2) Enable Delta auto-compaction (autoOptimize.autoCompact). 3) Schedule OPTIMIZE to run hourly/daily. 4) Use optimizeWrite to coalesce partitions during write. 5) Run VACUUM regularly to clean old files. Going from 50K files/day to ~200 files/day can cut storage overhead by 90%.

Common Gotchas

These traps catch even experienced engineers. Know them before your interview.

🚫 SELECT * in Production

Scans every column in columnar formats. A 50-column Parquet table scanned fully costs 10x vs selecting 5 columns.

🚫 Partition by High-Cardinality

Partitioning by user_id (10M users) creates 10M micro-partitions. Use Z-ordering or bucketing instead.

🚫 Forgetting to VACUUM

Delta Lake keeps old file versions. Without VACUUM, storage doubles every few weeks from retained history.

🚫 Wrong Shuffle Partition Count

200 partitions for 1 TB = 5 GB per partition (OOM risk). 200 partitions for 1 GB = 5 MB each (overhead waste). Tune it.

🚫 Always-On Clusters

A cluster running 24/7 at 15% utilization wastes 85% of spend. Auto-terminate + auto-scale saves 60-80%.

🚫 Full Refresh on Large Tables

Reprocessing 2 TB daily when only 5 GB is new. Incremental processing would be 400x cheaper.

Rapid-Fire Quiz: Part 1

Q1: A table has 50 columns but your query uses 3. How much data should be scanned?

Q2: Why does a broadcast join save money?

Q3: How does Z-ordering reduce cost?

Q4: Your daily pipeline reprocesses a 5 TB table that grows 10 GB/day. Best cost optimization?

Q5: An analyst costs $250 with one BigQuery SELECT *. Best prevention?

Rapid-Fire Quiz: Part 2

Q6: What's the recommended first step for introducing FinOps culture?

Q7: In Spark UI, one task takes 45 min while 199 others take 30 sec each. What's the problem?

Q8: A nightly batch job runs 4 hours and can tolerate restarts. How to cut compute cost 70%?

Q9: Delta table storage doubled despite no new data. What's the fix?

Q10: Which FinOps metrics would you track first?

Your Score

—

out of 10 questions answered correctly

🎯 What's Next?

If you scored 8+, you're interview-ready on cost optimization. 5-7: review Modules 2-5. Under 5: restart from Module 1 and take notes on each callout box. Remember: interviewers want to hear structured thinking, not just the right answer.

💡 Final Interview Tip

When answering cost questions, always use this structure: 1) Identify the cost driver. 2) Explain why it's expensive. 3) Propose 2-3 concrete optimizations with trade-offs. 4) Quantify the expected savings if possible.