📦 How Partitioning Saves Money

Good partitioning enables partition pruning — the query engine skips data it doesn't need. Bad partitioning creates thousands of tiny files that increase overhead.

Query: WHERE date = '2024-03-15'
Scan 1 partition
Skip 364 partitions

Without partitioning, you'd scan the entire table — 365x more data, 365x more cost on per-byte billing.

💡 Interview Gold

"Partition on columns used in WHERE clauses. The ideal partition has enough data per partition (100 MB–1 GB) and matches the most common query access patterns."

The Small Files Problem

Small files are the silent cost killer. Each file carries fixed overhead: metadata, open/close, scheduling a task. With 100,000 tiny files, engines spend more time planning than processing.

🚫 100K × 1 MB Files

  • 100K file open/close operations
  • Massive metadata overhead
  • 100K task scheduling events
  • Driver OOM risk from metadata
  • 20-50x slower than optimal

✅ 400 × 256 MB Files

  • 400 file operations
  • Minimal metadata footprint
  • 400 well-sized tasks
  • Healthy parallelism balance
  • Optimal read throughput
⚠️ Common Causes

Streaming micro-batches, overly granular partitions (e.g., partition by user_id with millions of users), and append-heavy workloads without compaction.

Target File Sizes

The sweet spot balances parallelism against overhead. Too small = scheduling waste. Too large = poor parallelism and slow selective queries.

🔴 < 32 MB

Too small. Overhead dominates. Compaction needed urgently.

🟡 32–128 MB

Acceptable for high-cardinality partitions. Monitor for growth.

🟢 128 MB–1 GB

The sweet spot. Most engines optimize for this range. Spark default target is ~128 MB.

🔵 > 1 GB

Works for sequential scans but hurts selective queries. Consider if parallelism is sufficient.

-- Delta Lake auto-optimize example ALTER TABLE events SET TBLPROPERTIES ( 'delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true', 'delta.targetFileSize' = '268435456' -- 256 MB );

Compaction & Clustering

Compaction merges small files into larger ones. Clustering (Z-ordering) organizes data within files for better data skipping.

1000 small files
OPTIMIZE (compact)
10 large files
ZORDER BY col
Co-located data
-- Delta Lake: compact + Z-order OPTIMIZE events ZORDER BY (user_id, event_type); -- Iceberg: rewrite with sort order CALL catalog.system.rewrite_data_files( table => 'db.events', strategy => 'sort', sort_order => 'user_id ASC, event_type ASC' );
🎯 Z-Order vs Linear Sort

Z-ordering interleaves bits from multiple columns so queries filtering on any combination of Z-ordered columns benefit from data skipping. Linear sort only helps the first column significantly.

Storage Tiering

Not all data deserves the same storage class. Tiering moves cold data to cheaper storage automatically.

🔥 Hot

Last 7-30 days. Frequent access. Standard storage. Highest $/GB.

🌤️ Warm

30-90 days. Occasional access. Infrequent Access tier. ~40% savings.

❄️ Cold / Archive

90+ days. Rarely accessed. Glacier / Archive. ~80% savings. Retrieval delay.

💡 Interview Gold

"I set up S3 lifecycle policies to transition objects older than 90 days to Glacier. For a 50 TB dataset, this saved ~$800/month. But I kept Parquet metadata in hot storage so the catalog could still plan queries."

Quiz: Test Yourself

Q1: Why does partitioning reduce cost on per-byte billing?

Q2: What's the recommended target file size for Parquet?

Q3: Which scenario most commonly causes the small files problem?

Q4: Why use Z-ordering instead of a regular sort?

Q5: A table has 3 years of data but only last 30 days are queried. What's the best cost strategy?