Data platforms can scale costs linearly — or worse — with data volume and usage. A Spark cluster that costs $500/day at 1 TB can cost $15K/day at 30 TB without proper optimization.
Cost optimization ensures the platform stays sustainable while meeting SLAs. It requires understanding storage layout, compute behavior, query patterns, and operational practices.
"Cost optimization isn't about being cheap — it's about maximizing business value per dollar. A senior data engineer must balance performance, reliability, and cost at every design decision."
Every data platform cost ultimately falls into one of these four buckets.
Cluster hours, warehouse credits, serverless slots. Usually the largest cost driver — 50-70% of total spend in most orgs.
Raw + derived + duplicate data. Grows silently. A single table with 3 copies across layers can 3x your storage bill.
Cross-region transfers, cloud egress, inter-AZ traffic. Often overlooked until the bill arrives.
Backfills, retries, compaction, maintenance. A full historical backfill can cost more than 6 months of normal processing.
Performance and cost are not opposites. Most optimizations improve both simultaneously.
The pattern repeats: less data moved = less time = less money. Partition pruning, column pruning, filter pushdown — they all reduce both runtime and cost.
Sometimes you trade cost for speed (e.g., over-provisioned clusters for SLA). Or trade speed for cost (e.g., spot instances with possible preemption). The interview question is: how do you decide which trade-off to make?
Know these cold — interviewers test whether you understand what you're actually paying for.
FinOps is not a tool — it's a culture shift. Engineers own their costs like they own their uptime.
Make costs visible. Dashboards, tagging, cost-per-pipeline metrics. You can't optimize what you can't see.
Right-size clusters, use spot instances, partition data, prune columns. Continuous small wins compound.
Automated alerts, guardrails, budget thresholds. Embed cost checks into CI/CD and code reviews.
"I believe every pull request that touches a pipeline should include a cost impact estimate — even a rough one. This normalizes cost thinking across the team."