Data quality means data is fit for its intended use. Analysts can trust metrics. ML models get clean features. Dashboards tell the truth.
It's both a technical problem (pipelines, checks, monitoring) and a product problem (definitions, contracts, ownership).
"Data quality is fitness for purpose — data that is correct, complete, consistent, timely, and well-governed so every downstream consumer can make decisions with confidence."
In interviews, emphasize that quality isn't a single tool. It's a defense-in-depth strategy spanning the entire data lifecycle.
Interviewers love asking "why does data quality matter?" Answer with concrete business impact.
| Failure | Business Impact |
|---|---|
| Duplicate customer records | Over-reporting revenue, sending duplicate emails, compliance violations (GDPR) |
| Stale dashboard data | Executives make decisions on yesterday's numbers — launch a promo that already expired |
| Wrong join keys | Row explosion: 1M rows become 50M, breaking downstream models and reports |
| Timezone bugs | Daily aggregates shift by hours — "Monday" revenue includes Sunday night |
| Silent schema change | Column renamed upstream → NULLs propagate silently for days before anyone notices |
A fintech company shipped a pricing model trained on data with duplicate rows. The model over-weighted certain segments. Result: $2M in mispriced loans before the error was caught.
Data quality isn't just "add more tests." It requires treating data as a product with clear ownership.
Every dataset has an owner. That team defines SLAs, responds to alerts, and manages contracts.
Data catalogs (Datahub, Amundsen) let consumers find data and understand its quality status.
Quality scores, freshness badges, and lineage graphs let consumers judge reliability at a glance.
Consumers report issues. Producers fix root causes. Without this loop, quality degrades silently.
"I treat data quality as a product discipline: every table has an owner, an SLA, and a contract. Quality isn't a gate at the end — it's baked into the development workflow."
Good data quality has three layers. Interviewers want to see you think systemically.
Prevention catches issues before they reach production. Detection catches what slips through. Remediation minimizes blast radius. You need all three.
Most teams over-invest in detection and under-invest in prevention. Data contracts (Module 3) shift quality left.
Know these tools — interviewers expect you to name specific solutions, not just concepts.
"For dbt projects I'd use built-in tests for basics and Elementary for anomaly detection. For cross-platform observability, Monte Carlo or Soda give you freshness, volume, and schema monitoring out of the box."