Module 2: Storage & Partitioning | Cost Optimization Mastery

📦 How Partitioning Saves Money

Good partitioning enables partition pruning — the query engine skips data it doesn't need. Bad partitioning creates thousands of tiny files that increase overhead.

Query: WHERE date = '2024-03-15'

→

Scan 1 partition

→

Skip 364 partitions

Without partitioning, you'd scan the entire table — 365x more data, 365x more cost on per-byte billing.

💡 Interview Gold

"Partition on columns used in WHERE clauses. The ideal partition has enough data per partition (100 MB–1 GB) and matches the most common query access patterns."

The Small Files Problem

Small files are the silent cost killer. Each file carries fixed overhead: metadata, open/close, scheduling a task. With 100,000 tiny files, engines spend more time planning than processing.

🚫 100K × 1 MB Files

100K file open/close operations
Massive metadata overhead
100K task scheduling events
Driver OOM risk from metadata
20-50x slower than optimal

✅ 400 × 256 MB Files

400 file operations
Minimal metadata footprint
400 well-sized tasks
Healthy parallelism balance
Optimal read throughput

⚠️ Common Causes

Streaming micro-batches, overly granular partitions (e.g., partition by user_id with millions of users), and append-heavy workloads without compaction.

Target File Sizes

The sweet spot balances parallelism against overhead. Too small = scheduling waste. Too large = poor parallelism and slow selective queries.

🔴 < 32 MB

Too small. Overhead dominates. Compaction needed urgently.

🟡 32–128 MB

Acceptable for high-cardinality partitions. Monitor for growth.

🟢 128 MB–1 GB

The sweet spot. Most engines optimize for this range. Spark default target is ~128 MB.

🔵 > 1 GB

Works for sequential scans but hurts selective queries. Consider if parallelism is sufficient.

-- Delta Lake auto-optimize example
ALTER TABLE events
SET TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact'   = 'true',
  'delta.targetFileSize'             = '268435456'  -- 256 MB
);

Compaction & Clustering

Compaction merges small files into larger ones. Clustering (Z-ordering) organizes data within files for better data skipping.

1000 small files

→

OPTIMIZE (compact)

→

10 large files

→

ZORDER BY col

→

Co-located data

-- Delta Lake: compact + Z-order
OPTIMIZE events ZORDER BY (user_id, event_type);

-- Iceberg: rewrite with sort order
CALL catalog.system.rewrite_data_files(
  table => 'db.events',
  strategy => 'sort',
  sort_order => 'user_id ASC, event_type ASC'
);

🎯 Z-Order vs Linear Sort

Z-ordering interleaves bits from multiple columns so queries filtering on any combination of Z-ordered columns benefit from data skipping. Linear sort only helps the first column significantly.

Storage Tiering

Not all data deserves the same storage class. Tiering moves cold data to cheaper storage automatically.

🔥 Hot

Last 7-30 days. Frequent access. Standard storage. Highest $/GB.

🌤️ Warm

30-90 days. Occasional access. Infrequent Access tier. ~40% savings.

❄️ Cold / Archive

90+ days. Rarely accessed. Glacier / Archive. ~80% savings. Retrieval delay.

💡 Interview Gold

"I set up S3 lifecycle policies to transition objects older than 90 days to Glacier. For a 50 TB dataset, this saved ~$800/month. But I kept Parquet metadata in hot storage so the catalog could still plan queries."

📦 How Partitioning Saves Money

The Small Files Problem

🚫 100K × 1 MB Files

✅ 400 × 256 MB Files

Target File Sizes

🔴 < 32 MB

🟡 32–128 MB

🟢 128 MB–1 GB

🔵 > 1 GB

Compaction & Clustering

Storage Tiering

🔥 Hot

🌤️ Warm

❄️ Cold / Archive

Quiz: Test Yourself

Q1: Why does partitioning reduce cost on per-byte billing?

Q2: What's the recommended target file size for Parquet?

Q3: Which scenario most commonly causes the small files problem?

Q4: Why use Z-ordering instead of a regular sort?

Q5: A table has 3 years of data but only last 30 days are queried. What's the best cost strategy?