🛡️ What Is Data Quality?

Data quality means data is fit for its intended use. Analysts can trust metrics. ML models get clean features. Dashboards tell the truth.

It's both a technical problem (pipelines, checks, monitoring) and a product problem (definitions, contracts, ownership).

Interview Gold

"Data quality is fitness for purpose — data that is correct, complete, consistent, timely, and well-governed so every downstream consumer can make decisions with confidence."

In interviews, emphasize that quality isn't a single tool. It's a defense-in-depth strategy spanning the entire data lifecycle.

The Cost of Bad Data

Interviewers love asking "why does data quality matter?" Answer with concrete business impact.

FailureBusiness Impact
Duplicate customer recordsOver-reporting revenue, sending duplicate emails, compliance violations (GDPR)
Stale dashboard dataExecutives make decisions on yesterday's numbers — launch a promo that already expired
Wrong join keysRow explosion: 1M rows become 50M, breaking downstream models and reports
Timezone bugsDaily aggregates shift by hours — "Monday" revenue includes Sunday night
Silent schema changeColumn renamed upstream → NULLs propagate silently for days before anyone notices
Real-World Horror Story

A fintech company shipped a pricing model trained on data with duplicate rows. The model over-weighted certain segments. Result: $2M in mispriced loans before the error was caught.

Data Quality as a Product Problem

Data quality isn't just "add more tests." It requires treating data as a product with clear ownership.

Ownership

Every dataset has an owner. That team defines SLAs, responds to alerts, and manages contracts.

Discoverability

Data catalogs (Datahub, Amundsen) let consumers find data and understand its quality status.

Trust Signals

Quality scores, freshness badges, and lineage graphs let consumers judge reliability at a glance.

Feedback Loops

Consumers report issues. Producers fix root causes. Without this loop, quality degrades silently.

Interview Tip

"I treat data quality as a product discipline: every table has an owner, an SLA, and a contract. Quality isn't a gate at the end — it's baked into the development workflow."

The Defense-in-Depth Model

Good data quality has three layers. Interviewers want to see you think systemically.

Prevention
Contracts, schemas, CI checks
Detection
Tests, monitors, anomaly detection
Remediation
Quarantine, alerts, runbooks
Key Insight

Prevention catches issues before they reach production. Detection catches what slips through. Remediation minimizes blast radius. You need all three.

Most teams over-invest in detection and under-invest in prevention. Data contracts (Module 3) shift quality left.

Data Quality in the Modern Stack

Know these tools — interviewers expect you to name specific solutions, not just concepts.

🔧
dbt Tests
SQL assertions: not_null, unique, accepted_values, relationships
🎯
Great Expectations
Python-based expectations with profiling and data docs
📡
Monte Carlo
Data observability: freshness, volume, schema, distribution
🔍
Soda Core
YAML-defined checks that run on any SQL warehouse
📊
Elementary
dbt-native data observability and anomaly detection
📋
Datafold
Data diff and regression testing for pipeline changes
Interview Tip

"For dbt projects I'd use built-in tests for basics and Elementary for anomaly detection. For cross-platform observability, Monte Carlo or Soda give you freshness, volume, and schema monitoring out of the box."

Quiz: Test Yourself

Q1: Your interviewer asks "What does data quality mean?" — what's the strongest answer?

Q2: What's the MOST compelling way to explain the cost of bad data to a non-technical stakeholder?

Q3: In the defense-in-depth model, where do data contracts fit?

Q4: Which tool is primarily a data observability platform rather than a testing framework?

Q5: Where do most data teams under-invest in quality?