Data testing is different from software testing. You're testing data correctness, not code behavior. Here are the four approaches.
Small fixture inputs → run transform → compare to expected output. Use golden datasets as ground truth.
Assert invariants: revenue ≥ 0, row count monotonically increases, SUM(parts) = total. These catch edge cases you didn't think of.
Compare source vs target: row counts match, SUM(amount) matches, distinct key counts match. Catches data loss in transit.
After a code change, key metrics shouldn't shift unexpectedly. Use data diff tools like Datafold.
Golden datasets are the backbone of data unit testing. They let you test transforms in isolation.
NULLs, empty strings, max-length values, negative numbers, timezone boundaries, Unicode characters.
Store golden files in Git alongside your models. They're documentation AND tests. Review changes in PRs.
10-50 rows max. Enough to cover business logic paths without becoming a maintenance burden.
"I test critical transforms with golden datasets stored in Git. Each fixture covers happy path, edge cases, and error scenarios. When a bug is found in prod, I add a fixture for that case — similar to adding a unit test for a code bug."
Don't block the entire pipeline for a few bad rows. Split data into "good" and "quarantine" streams.
For high-criticality tables (revenue, compliance), you may want to BLOCK publishing entirely when quarantine rates exceed a threshold (e.g., >5% bad rows). Keep raw ingestion running but hold the mart.
Interviewers test whether you know the difference. They solve different problems.
Validation tells you the data is well-formed. Reconciliation tells you it's complete and consistent with the source. A table can pass validation (all rows are valid) but fail reconciliation (50% of source rows are missing).
Pipelines fail and retry. If a retry produces different results, you have a quality problem.
Running the same pipeline twice on the same input produces the same output. No duplicates, no missing data, no side effects.
Use MERGE/UPSERT instead of INSERT for incremental loads. INSERT + retry = duplicates. MERGE + retry = same result.
Alternative pattern: delete the partition/batch, then insert. Ensures clean state on retry. Wrap in a transaction.
Tag each batch with a unique run_id. On retry, detect and skip already-processed batches.
"I design every pipeline to be idempotent. For incremental loads, I use MERGE with a natural key or delete-then-insert within a transaction. This means any retry produces the exact same result — no duplicates, no data loss."