Module 1: What Is Data Quality? | Data Quality Interview Mastery

🛡️ What Is Data Quality?

Data quality means data is fit for its intended use. Analysts can trust metrics. ML models get clean features. Dashboards tell the truth.

It's both a technical problem (pipelines, checks, monitoring) and a product problem (definitions, contracts, ownership).

Interview Gold

"Data quality is fitness for purpose — data that is correct, complete, consistent, timely, and well-governed so every downstream consumer can make decisions with confidence."

In interviews, emphasize that quality isn't a single tool. It's a defense-in-depth strategy spanning the entire data lifecycle.

The Cost of Bad Data

Interviewers love asking "why does data quality matter?" Answer with concrete business impact.

Failure	Business Impact
Duplicate customer records	Over-reporting revenue, sending duplicate emails, compliance violations (GDPR)
Stale dashboard data	Executives make decisions on yesterday's numbers — launch a promo that already expired
Wrong join keys	Row explosion: 1M rows become 50M, breaking downstream models and reports
Timezone bugs	Daily aggregates shift by hours — "Monday" revenue includes Sunday night
Silent schema change	Column renamed upstream → NULLs propagate silently for days before anyone notices

Real-World Horror Story

A fintech company shipped a pricing model trained on data with duplicate rows. The model over-weighted certain segments. Result: $2M in mispriced loans before the error was caught.

Data Quality as a Product Problem

Data quality isn't just "add more tests." It requires treating data as a product with clear ownership.

Ownership

Every dataset has an owner. That team defines SLAs, responds to alerts, and manages contracts.

Discoverability

Data catalogs (Datahub, Amundsen) let consumers find data and understand its quality status.

Trust Signals

Quality scores, freshness badges, and lineage graphs let consumers judge reliability at a glance.

Feedback Loops

Consumers report issues. Producers fix root causes. Without this loop, quality degrades silently.

Interview Tip

"I treat data quality as a product discipline: every table has an owner, an SLA, and a contract. Quality isn't a gate at the end — it's baked into the development workflow."

The Defense-in-Depth Model

Good data quality has three layers. Interviewers want to see you think systemically.

Prevention
Contracts, schemas, CI checks

→

Detection
Tests, monitors, anomaly detection

→

Remediation
Quarantine, alerts, runbooks

Key Insight

Prevention catches issues before they reach production. Detection catches what slips through. Remediation minimizes blast radius. You need all three.

Most teams over-invest in detection and under-invest in prevention. Data contracts (Module 3) shift quality left.

Data Quality in the Modern Stack

Know these tools — interviewers expect you to name specific solutions, not just concepts.

🔧

dbt Tests

SQL assertions: not_null, unique, accepted_values, relationships

🎯

Great Expectations

Python-based expectations with profiling and data docs

📡

Monte Carlo

Data observability: freshness, volume, schema, distribution

🔍

Soda Core

YAML-defined checks that run on any SQL warehouse

📊

Elementary

dbt-native data observability and anomaly detection

📋

Datafold

Data diff and regression testing for pipeline changes

Interview Tip

"For dbt projects I'd use built-in tests for basics and Elementary for anomaly detection. For cross-platform observability, Monte Carlo or Soda give you freshness, volume, and schema monitoring out of the box."

🛡️ What Is Data Quality?

The Cost of Bad Data

Data Quality as a Product Problem

Ownership

Discoverability

Trust Signals

Feedback Loops

The Defense-in-Depth Model

Data Quality in the Modern Stack

Quiz: Test Yourself

Q1: Your interviewer asks "What does data quality mean?" — what's the strongest answer?

Q2: What's the MOST compelling way to explain the cost of bad data to a non-technical stakeholder?

Q3: In the defense-in-depth model, where do data contracts fit?

Q4: Which tool is primarily a data observability platform rather than a testing framework?

Q5: Where do most data teams under-invest in quality?