Testing Data Transformations

Data testing is different from software testing. You're testing data correctness, not code behavior. Here are the four approaches.

Unit-Like Tests

Small fixture inputs → run transform → compare to expected output. Use golden datasets as ground truth.

Property Tests

Assert invariants: revenue ≥ 0, row count monotonically increases, SUM(parts) = total. These catch edge cases you didn't think of.

Reconciliation Tests

Compare source vs target: row counts match, SUM(amount) matches, distinct key counts match. Catches data loss in transit.

Regression Tests

After a code change, key metrics shouldn't shift unexpectedly. Use data diff tools like Datafold.

-- dbt test: property test for non-negative revenue SELECT order_id, revenue FROM {{ ref('fct_orders') }} WHERE revenue < 0 -- Test passes if this returns 0 rows

Golden Datasets & Fixtures

Golden datasets are the backbone of data unit testing. They let you test transforms in isolation.

Fixture Input
10-50 hand-crafted rows
Run Transform
dbt model / SQL / Python
Compare Output
vs expected golden file

Include Edge Cases

NULLs, empty strings, max-length values, negative numbers, timezone boundaries, Unicode characters.

Version Control Them

Store golden files in Git alongside your models. They're documentation AND tests. Review changes in PRs.

Keep Them Small

10-50 rows max. Enough to cover business logic paths without becoming a maintenance burden.

Interview Tip

"I test critical transforms with golden datasets stored in Git. Each fixture covers happy path, edge cases, and error scenarios. When a bug is found in prod, I add a fixture for that case — similar to adding a unit test for a code bug."

The Quarantine Pattern

Don't block the entire pipeline for a few bad rows. Split data into "good" and "quarantine" streams.

Raw Data
All incoming rows
Validate
Apply quality rules
Good Data
→ production tables
Bad Data
→ quarantine + error reason
-- Quarantine pattern in SQL INSERT INTO orders_clean SELECT * FROM orders_raw WHERE amount >= 0 AND user_id IS NOT NULL AND status IN ('pending','paid','refunded'); INSERT INTO orders_quarantine SELECT *, 'validation_failed' AS error_reason FROM orders_raw WHERE amount < 0 OR user_id IS NULL OR status NOT IN ('pending','paid','refunded');
Critical Decision

For high-criticality tables (revenue, compliance), you may want to BLOCK publishing entirely when quarantine rates exceed a threshold (e.g., >5% bad rows). Keep raw ingestion running but hold the mart.

Validation vs Reconciliation

Interviewers test whether you know the difference. They solve different problems.

Validation

  • 🔍 Checks if data satisfies rules
  • 🔍 Single dataset focus
  • 🔍 Types, constraints, ranges
  • 🔍 "Is this data valid?"
  • 🔍 Runs on the target data

Reconciliation

  • 🔗 Compares two datasets
  • 🔗 Source vs target focus
  • 🔗 Row counts, sums, key sets
  • 🔗 "Did we lose or gain data?"
  • 🔗 Runs across systems
Key Insight

Validation tells you the data is well-formed. Reconciliation tells you it's complete and consistent with the source. A table can pass validation (all rows are valid) but fail reconciliation (50% of source rows are missing).

Idempotency for Retry-Safe Pipelines

Pipelines fail and retry. If a retry produces different results, you have a quality problem.

What Is Idempotency?

Running the same pipeline twice on the same input produces the same output. No duplicates, no missing data, no side effects.

MERGE Over INSERT

Use MERGE/UPSERT instead of INSERT for incremental loads. INSERT + retry = duplicates. MERGE + retry = same result.

Delete-Then-Insert

Alternative pattern: delete the partition/batch, then insert. Ensures clean state on retry. Wrap in a transaction.

Idempotency Keys

Tag each batch with a unique run_id. On retry, detect and skip already-processed batches.

Interview Gold

"I design every pipeline to be idempotent. For incremental loads, I use MERGE with a natural key or delete-then-insert within a transaction. This means any retry produces the exact same result — no duplicates, no data loss."

Quiz: Test Yourself

Q1: Your warehouse has fewer rows than the source system. Which testing approach detects this?

Q2: Your pipeline uses INSERT INTO for incremental loads. It fails and retries. What happens?

Q3: What is the key benefit of the quarantine pattern over simply failing the pipeline?

Q4: You want to ensure revenue is never negative across all historical data, not just test fixtures. Which approach?

Q5: Why must delete-then-insert be wrapped in a transaction?