Module 6: The Interview Gauntlet | Data Quality Interview Mastery

The Interview Gauntlet 🔥

Tricky scenarios, debugging exercises, and rapid-fire questions. Click "Reveal Answer" to check your thinking.

Scenario 1: The Silent Dashboard

A stakeholder says "the revenue dashboard hasn't updated since yesterday, but nobody got an alert." What happened and how do you prevent it?

Likely: the freshness check has a threshold that's too loose (e.g., 48h for a daily table), OR the alert is routing to a channel nobody monitors. Fix: Set freshness SLA to 26h for daily tables (buffer for late runs), route to PagerDuty for critical dashboards, and add a meta-alert that fires if no checks run at all ("dead man's switch").

Scenario 2: Row Count Doubled Overnight

Your orders table suddenly has 2x the expected rows. All row-level checks pass. What's your investigation plan?

Step 1: Check for duplicate primary keys (SELECT pk, COUNT(*) HAVING COUNT(*) > 1). Step 2: Check if a retry/reprocessing happened (look at _loaded_at timestamps). Step 3: Check if a JOIN upstream changed from 1:1 to 1:N due to a non-unique key. Root cause is almost always duplicates from retries or bad joins.

Scenario 3: "My Numbers Don't Match"

A data analyst says the warehouse total revenue differs from the billing system by $50K. How do you debug this?

This is a reconciliation problem. Step 1: Compare row counts (source vs warehouse). Step 2: Compare SUM(amount) by date to find which day diverges. Step 3: Check for duplicates, missing rows, or currency/timezone differences. Step 4: Check if refunds or adjustments are included in one but not the other. Build an automated reconciliation check to prevent recurrence.

Debugging Exercises

Debug 1: The Monday Spike

Every Monday, your daily active users metric jumps 30% then drops Tuesday. The pipeline hasn't changed. What's causing this?

Timezone bug. If your events are in UTC but your daily partitioning uses local time, Sunday evening UTC events (which are Monday local time in some zones) pile into Monday's partition. Fix: ensure consistent timezone handling — either all UTC or explicit timezone conversion before partitioning.

Debug 2: The Disappearing Rows

Your staging table has 1M rows, but after your dbt model runs, the mart has only 800K. No errors in logs. What happened?

Most likely: an INNER JOIN dropped 200K rows that don't have matching keys in the joined table. Alternatives: a WHERE clause is too aggressive, or a DISTINCT is collapsing rows with duplicate keys. Debug: Run the query without each JOIN/filter incrementally to find where rows drop. Add a reconciliation check: source count vs output count.

Debug 3: NULLs Appearing After Deploy

After deploying a new version of your API, a column that was never NULL suddenly has 15% NULLs. No schema change detected. Why?

The API changed its behavior — a field that used to be required is now optional in the new version. Schema didn't change (column still exists, same type) but the data contract was implicitly broken. Fix: Add NOT NULL checks in your data contract, and require API teams to update the data contract when changing field requirements.

Debug 4: The Phantom Revenue

Revenue increased 200% in the warehouse but the source system shows no change. All pipeline checks pass. What's wrong?

Classic JOIN explosion. Somewhere in the pipeline, a JOIN key is non-unique, causing rows to multiply. Each duplicate order × 3 matching records = 3x revenue. Debug: Check DISTINCT counts of join keys on both sides. Add a uniqueness test on join keys before every JOIN. This is why aggregate reconciliation (SUM(amount) source vs target) is critical.

Scenario Questions

Scenario 5: Schema Drift in Production

An upstream team renames a column in their API response without telling you. Your pipeline doesn't fail — it just fills the old column with NULLs. How do you prevent this?

Implement a data contract with schema validation at ingestion. Check that all expected columns exist AND have non-trivial NULL rates. A sudden spike in NULL rate for a previously non-null column should trigger an alert. Also: enforce contracts in the producer's CI — they can't deploy without passing schema compatibility checks.

Scenario 6: Alert Fatigue

Your team has 200 data quality alerts. 80% are false positives. Engineers ignore them all. How do you fix this?

Step 1: Audit all alerts — delete or mute ones with >50% false positive rate. Step 2: Switch from static to dynamic thresholds. Step 3: Tier alerts by severity (P0/P1/P2). Step 4: Only page for P0 (critical business impact). Step 5: Add runbooks to every alert. Step 6: Track alert-to-action ratio as a metric. Goal: fewer alerts, each one actionable.

Scenario 7: Late-Arriving Data Mess

Your mobile app batches events and sends them when the user regains internet. Some events are 3 days late. How do you handle this without re-running 3 days of pipeline history?

Use a "late-arriving data" handler: ingest by ingestion_time (append-only), but partition downstream models by event_time. Set a completeness window (e.g., 7 days) — partitions older than 7 days are "closed" and not re-computed. For closed partitions, route late events to a correction table for manual or periodic reconciliation.

Scenario 8: Quarantine Overflow

Your quarantine table grows to 10% of total data. Stakeholders ask "is this normal?" What do you do?

10% quarantine rate is a red flag. Step 1: Analyze quarantine reasons — are they concentrated in one rule or spread evenly? Step 2: If one rule dominates, it might be too strict (e.g., rejecting valid edge cases). Step 3: Work with the producer to fix the root cause. Step 4: Set a quarantine rate alert (>5% = warning, >10% = critical). Quarantine is for exceptions, not for hiding systemic problems.

Rapid-Fire Q&A

Click each question to reveal the answer. Aim for 30 seconds per answer.

What's the difference between data observability and data testing?

Testing is explicit (you write rules). Observability is automatic (monitors metrics like freshness, volume, distribution without writing specific tests). You need both — tests for known invariants, observability for unknown unknowns.

How would you implement data quality for a Kafka streaming pipeline?

Schema validation at the producer (Avro/Protobuf with Schema Registry), consumer-side checks on each message batch, dead-letter queue for invalid messages, real-time anomaly detection on throughput and error rates. Don't block the stream — quarantine bad messages.

What's a "dead man's switch" for data quality?

An alert that fires when your monitoring system itself fails to run. If your daily quality checks didn't execute at all (not "passed" or "failed" — just didn't run), the dead man's switch catches it. Without this, a broken scheduler silently disables all your quality checks.

How do you handle data quality for ML feature stores?

Feature-level validation (ranges, distributions, NULL rates), training-serving skew detection (compare feature distributions in training vs serving), freshness guarantees per feature, and point-in-time correctness checks to prevent data leakage.

What's the cost of over-engineering data quality?

Too many checks slow pipeline runs, create maintenance burden, and produce alert fatigue. Start with the "starter five" (freshness, volume, PK uniqueness, NOT NULL, referential integrity), then add domain-specific checks only when a real issue occurs. Quality is a spectrum, not a binary.

How do you prioritize which tables to add quality checks to first?

By business impact. Start with tables that feed executive dashboards, compliance reports, or customer-facing features. A bug in an internal experiment table is annoying; a bug in the billing table is catastrophic. Use a criticality tier system.

What happens when you can't fix bad data at the source?

Document it. Apply defensive transformations (COALESCE, CASE WHEN for known bad values), quarantine what you can't fix, and add prominent warnings in the data catalog. Don't silently propagate known-bad data — make it visible.

Final Quiz: Prove Your Mastery

Q1: Your entire data quality monitoring system goes down for 3 days. What alert should have caught this?

Q2: After adding a new JOIN to your pipeline, total revenue triples. Source data hasn't changed. Root cause?

Q3: You have 200 tables and no quality checks. Where do you start?

Q4: Your pipeline must be retry-safe. Which write pattern ensures idempotency?

Q5: A team says "we have data contracts" but only has a JSON schema file. What are they missing?

Common Pitfalls & Gotchas

Pitfall 1: Testing only happy paths

Your golden datasets must include NULLs, empty strings, duplicates, max-length values, and timezone edges. Production data is messy — test for mess.

Pitfall 2: Checking row-level but not aggregate

A table with zero rows passes all NOT NULL checks. Always pair row-level rules with volume and freshness checks.

Pitfall 3: Alerts without runbooks

An alert that fires at 3 AM with no instructions on what to do is worse than no alert. Every alert needs an owner and a step-by-step runbook.

Pitfall 4: SELECT * in production

SELECT * breaks when columns are added or reordered. Always enumerate columns explicitly in production queries and use contract-enforced schemas.

Pitfall 5: Trusting the source blindly

Even trusted APIs change behavior. A field that was "always populated" can become optional in a new API version. Validate at ingestion, not just at transformation.

Pitfall 6: Ignoring quarantine

If nobody reviews quarantined data, you're just hiding problems. Set a weekly quarantine review cadence and track quarantine rate as a metric.

You've completed the course! Go ace that interview. 🎉