Every data quality interview starts here. Memorize these six — they're the foundation for everything else.
Are all expected rows and fields present? Missing data = missing insights. Check: NULL rates, row counts vs expected.
Do values fall within allowed ranges/domains? An email without "@" or age = -5 are validity failures.
Does the data match real-world truth? Revenue in the warehouse should match the source system. Hardest to verify.
No contradictions across tables. If orders says 100 items and inventory says 50 shipped, something's wrong.
Data arrives within its SLA. A "real-time" dashboard showing yesterday's data is a timeliness failure.
No duplicates where keys should be unique. Duplicates inflate metrics and break joins.
Interviewers love this distinction. They test different failure modes.
NOT NULL on critical columns"Row-level catches individual bad records. Aggregate catches systemic issues — like an entire partition missing. You need both because a table can pass every row-level check yet still be missing 90% of its data."
Most production pipelines are incremental. Your checks must handle partial loads.
New batch row count within expected range. A batch of 0 rows or 10x normal signals a problem.
Validate unique keys within the new batch AND across the full table (detect retry duplicates).
The max event_time in the new batch should be ≥ the previous batch's max. Going backward signals reprocessing issues.
After merge, total row count and key distributions should still make sense. Don't just check the delta.
Events don't always arrive in order. Mobile apps batch uploads, APIs retry, timezone gaps cause late data.
Set a policy: "we accept events up to 7 days late." Beyond that, they go to a dead-letter queue for manual review.
Completeness checks should compare event_time, not ingestion_time. A partition might look "complete" by load time but missing events.
Re-compute historical partitions when late data arrives. Use "close after N days" semantics for finalized reports.
Late-arriving data is a consistency vs timeliness tradeoff. Real-time dashboards show best-effort numbers. Daily batch reports wait for the completeness window to close before publishing.
An interviewer asks: "You just deployed a new table. What checks do you add day one?"
"These five checks catch 80% of production issues. Start here, then add domain-specific validations (revenue ranges, status enums) as you learn the data."