Module 5: Testing & Quarantine Patterns | Data Quality Interview Mastery

Testing Data Transformations

Data testing is different from software testing. You're testing data correctness, not code behavior. Here are the four approaches.

Unit-Like Tests

Small fixture inputs → run transform → compare to expected output. Use golden datasets as ground truth.

Property Tests

Assert invariants: revenue ≥ 0, row count monotonically increases, SUM(parts) = total. These catch edge cases you didn't think of.

Reconciliation Tests

Compare source vs target: row counts match, SUM(amount) matches, distinct key counts match. Catches data loss in transit.

Regression Tests

After a code change, key metrics shouldn't shift unexpectedly. Use data diff tools like Datafold.

-- dbt test: property test for non-negative revenue
SELECT order_id, revenue
FROM {{ ref('fct_orders') }}
WHERE revenue < 0
-- Test passes if this returns 0 rows

Golden Datasets & Fixtures

Golden datasets are the backbone of data unit testing. They let you test transforms in isolation.

Fixture Input
10-50 hand-crafted rows

→

Run Transform
dbt model / SQL / Python

→

Compare Output
vs expected golden file

Include Edge Cases

NULLs, empty strings, max-length values, negative numbers, timezone boundaries, Unicode characters.

Version Control Them

Store golden files in Git alongside your models. They're documentation AND tests. Review changes in PRs.

Keep Them Small

10-50 rows max. Enough to cover business logic paths without becoming a maintenance burden.

Interview Tip

"I test critical transforms with golden datasets stored in Git. Each fixture covers happy path, edge cases, and error scenarios. When a bug is found in prod, I add a fixture for that case — similar to adding a unit test for a code bug."

The Quarantine Pattern

Don't block the entire pipeline for a few bad rows. Split data into "good" and "quarantine" streams.

Raw Data
All incoming rows

→

Validate
Apply quality rules

→

Good Data
→ production tables

→

Bad Data
→ quarantine + error reason

-- Quarantine pattern in SQL
INSERT INTO orders_clean
SELECT * FROM orders_raw
WHERE amount >= 0
  AND user_id IS NOT NULL
  AND status IN ('pending','paid','refunded');

INSERT INTO orders_quarantine
SELECT *, 'validation_failed' AS error_reason
FROM orders_raw
WHERE amount < 0
  OR user_id IS NULL
  OR status NOT IN ('pending','paid','refunded');

Critical Decision

For high-criticality tables (revenue, compliance), you may want to BLOCK publishing entirely when quarantine rates exceed a threshold (e.g., >5% bad rows). Keep raw ingestion running but hold the mart.

Validation vs Reconciliation

Interviewers test whether you know the difference. They solve different problems.

Validation

🔍 Checks if data satisfies rules
🔍 Single dataset focus
🔍 Types, constraints, ranges
🔍 "Is this data valid?"
🔍 Runs on the target data

Reconciliation

🔗 Compares two datasets
🔗 Source vs target focus
🔗 Row counts, sums, key sets
🔗 "Did we lose or gain data?"
🔗 Runs across systems

Key Insight

Validation tells you the data is well-formed. Reconciliation tells you it's complete and consistent with the source. A table can pass validation (all rows are valid) but fail reconciliation (50% of source rows are missing).

Idempotency for Retry-Safe Pipelines

Pipelines fail and retry. If a retry produces different results, you have a quality problem.

What Is Idempotency?

Running the same pipeline twice on the same input produces the same output. No duplicates, no missing data, no side effects.

MERGE Over INSERT

Use MERGE/UPSERT instead of INSERT for incremental loads. INSERT + retry = duplicates. MERGE + retry = same result.

Delete-Then-Insert

Alternative pattern: delete the partition/batch, then insert. Ensures clean state on retry. Wrap in a transaction.

Idempotency Keys

Tag each batch with a unique run_id. On retry, detect and skip already-processed batches.

Interview Gold

"I design every pipeline to be idempotent. For incremental loads, I use MERGE with a natural key or delete-then-insert within a transaction. This means any retry produces the exact same result — no duplicates, no data loss."