Good partitioning enables partition pruning — the query engine skips data it doesn't need. Bad partitioning creates thousands of tiny files that increase overhead.
Query: WHERE date = '2024-03-15'
→
Scan 1 partition
→
Skip 364 partitions
Without partitioning, you'd scan the entire table — 365x more data, 365x more cost on per-byte billing.
💡 Interview Gold
"Partition on columns used in WHERE clauses. The ideal partition has enough data per partition (100 MB–1 GB) and matches the most common query access patterns."
The Small Files Problem
Small files are the silent cost killer. Each file carries fixed overhead: metadata, open/close, scheduling a task. With 100,000 tiny files, engines spend more time planning than processing.
🚫 100K × 1 MB Files
100K file open/close operations
Massive metadata overhead
100K task scheduling events
Driver OOM risk from metadata
20-50x slower than optimal
✅ 400 × 256 MB Files
400 file operations
Minimal metadata footprint
400 well-sized tasks
Healthy parallelism balance
Optimal read throughput
⚠️ Common Causes
Streaming micro-batches, overly granular partitions (e.g., partition by user_id with millions of users), and append-heavy workloads without compaction.
Target File Sizes
The sweet spot balances parallelism against overhead. Too small = scheduling waste. Too large = poor parallelism and slow selective queries.
🔴 < 32 MB
Too small. Overhead dominates. Compaction needed urgently.
🟡 32–128 MB
Acceptable for high-cardinality partitions. Monitor for growth.
🟢 128 MB–1 GB
The sweet spot. Most engines optimize for this range. Spark default target is ~128 MB.
🔵 > 1 GB
Works for sequential scans but hurts selective queries. Consider if parallelism is sufficient.
Z-ordering interleaves bits from multiple columns so queries filtering on any combination of Z-ordered columns benefit from data skipping. Linear sort only helps the first column significantly.
Storage Tiering
Not all data deserves the same storage class. Tiering moves cold data to cheaper storage automatically.
🔥 Hot
Last 7-30 days. Frequent access. Standard storage. Highest $/GB.
"I set up S3 lifecycle policies to transition objects older than 90 days to Glacier. For a 50 TB dataset, this saved ~$800/month. But I kept Parquet metadata in hot storage so the catalog could still plan queries."
Quiz: Test Yourself
Q1: Why does partitioning reduce cost on per-byte billing?
Q2: What's the recommended target file size for Parquet?
Q3: Which scenario most commonly causes the small files problem?
Q4: Why use Z-ordering instead of a regular sort?
Q5: A table has 3 years of data but only last 30 days are queried. What's the best cost strategy?