The 6 Dimensions of Data Quality

Every data quality interview starts here. Memorize these six — they're the foundation for everything else.

Completeness

Are all expected rows and fields present? Missing data = missing insights. Check: NULL rates, row counts vs expected.

Validity

Do values fall within allowed ranges/domains? An email without "@" or age = -5 are validity failures.

Accuracy

Does the data match real-world truth? Revenue in the warehouse should match the source system. Hardest to verify.

Consistency

No contradictions across tables. If orders says 100 items and inventory says 50 shipped, something's wrong.

Timeliness

Data arrives within its SLA. A "real-time" dashboard showing yesterday's data is a timeliness failure.

Uniqueness

No duplicates where keys should be unique. Duplicates inflate metrics and break joins.

Row-Level vs Aggregate Checks

Interviewers love this distinction. They test different failure modes.

Row-Level Checks

  • NOT NULL on critical columns
  • ✅ Regex: email format, phone pattern
  • ✅ Range: 0 ≤ age ≤ 150
  • ✅ Enum: status IN ('active','inactive')
  • ✅ Foreign key: user_id exists in users

Aggregate Checks

  • 📊 Row count within expected range
  • 📊 Distinct count of keys stable
  • 📊 SUM(revenue) within ±10% of yesterday
  • 📊 NULL rate below threshold per column
  • 📊 No missing date partitions
Interview Gold

"Row-level catches individual bad records. Aggregate catches systemic issues — like an entire partition missing. You need both because a table can pass every row-level check yet still be missing 90% of its data."

Checks for Incremental Pipelines

Most production pipelines are incremental. Your checks must handle partial loads.

Batch Size Bounds

New batch row count within expected range. A batch of 0 rows or 10x normal signals a problem.

Key Uniqueness on Batch

Validate unique keys within the new batch AND across the full table (detect retry duplicates).

Monotonic Watermarks

The max event_time in the new batch should be ≥ the previous batch's max. Going backward signals reprocessing issues.

Full-Table Invariants

After merge, total row count and key distributions should still make sense. Don't just check the delta.

-- dbt incremental check: detect retry duplicates SELECT order_id, COUNT(*) AS cnt FROM {{ this }} WHERE _loaded_at >= '{{ var("run_start") }}' GROUP BY order_id HAVING cnt > 1

Handling Late-Arriving Data

Events don't always arrive in order. Mobile apps batch uploads, APIs retry, timezone gaps cause late data.

Define Allowed Lateness

Set a policy: "we accept events up to 7 days late." Beyond that, they go to a dead-letter queue for manual review.

Track by Event Time

Completeness checks should compare event_time, not ingestion_time. A partition might look "complete" by load time but missing events.

Backfill Windows

Re-compute historical partitions when late data arrives. Use "close after N days" semantics for finalized reports.

Key Insight

Late-arriving data is a consistency vs timeliness tradeoff. Real-time dashboards show best-effort numbers. Daily batch reports wait for the completeness window to close before publishing.

Minimum Checks for Every Table

An interviewer asks: "You just deployed a new table. What checks do you add day one?"

1Freshness: MAX(updated_at) within SLA. If the table should update daily, alert when it's >26 hours stale.
2Volume: Row count within ±2 standard deviations of the trailing 7-day average.
3Primary Key Uniqueness: No duplicate keys. This catches retry bugs and bad merges.
4NOT NULL on Critical Columns: Business keys, timestamps, and foreign keys must never be NULL.
5Referential Integrity: Every foreign key maps to a valid parent record (where applicable).
Interview Tip

"These five checks catch 80% of production issues. Start here, then add domain-specific validations (revenue ranges, status enums) as you learn the data."

Quiz: Test Yourself

Q1: A table passes all NOT NULL and regex checks but is missing yesterday's partition entirely. Which check catches this?

Q2: Which quality dimension is hardest to measure automatically?

Q3: Your incremental pipeline retried after a timeout. What's the most likely data quality issue?

Q4: How should you track completeness for a dataset with late-arriving events?

Q5: Which is NOT one of the "starter five" checks for every new table?