Module 4: Monitoring, Alerts & Anomaly Detection | Data Quality Interview Mastery

Freshness Monitoring & SLAs

Freshness = how recently data was updated relative to its expected cadence. It's the #1 metric for data reliability.

-- Freshness check: alert if table is >2 hours stale
SELECT
  DATEDIFF('minute', MAX(updated_at), CURRENT_TIMESTAMP()) AS staleness_min
FROM analytics.orders
HAVING staleness_min > 120

Per-Dataset SLAs

"Real-time" = 5-minute SLA. "Daily batch" = 26-hour SLA (buffer for late runs). Define SLAs per table, not globally.

Measure Right Timestamp

Use the event timestamp (when it happened), not load timestamp (when ETL ran). Load time hides late-arriving data.

Freshness Dashboard

A single pane showing freshness status for all critical tables. Green/yellow/red. Stakeholders check before trusting reports.

Avoiding Noisy Alerts

Alert fatigue kills data quality programs. If your team ignores alerts, you have no monitoring. Here's how to fix it.

Dynamic Thresholds

Don't hardcode "row count > 1000." Use trailing 7-day average ± 2 standard deviations. Adapts to growth and seasonality.

Seasonality-Aware

Monday volume differs from Saturday. Holiday weeks differ from normal. Your baselines should account for day-of-week and known events.

Multi-Signal Correlation

Alert only when freshness AND volume AND error rate all degrade. Single-signal alerts have high false positive rates.

Maintenance Windows

Suppress alerts during planned maintenance, deployments, and known data source outages. Auto-resume after the window.

Interview Gold

"I use dynamic thresholds with seasonality baselines and multi-signal correlation. If volume drops AND freshness degrades, that's a real incident. Volume alone dropping on Sunday is probably normal."

Anomaly Detection for Metrics

Anomaly detection catches silent failures — data that looks valid but is wrong.

Good For

✅ Revenue/metric spikes or drops
✅ Distribution shifts in categorical data
✅ Unusual NULL rate increases
✅ Detecting silent pipeline failures

Watch Out For

⚠️ Needs context: is it a bug or a real event?
⚠️ Requires enough history for baselines
⚠️ Can produce false positives during sales/launches
⚠️ Doesn't replace explicit validation rules

Key Insight

Anomaly detection is a complement, not a replacement. Use explicit rules for known invariants (amount >= 0) and anomaly detection for unknown unknowns (unexpected metric shifts).

Alert Routing & Severity

Not all alerts deserve the same response. Tier your alerts and route them appropriately.

P0 — Critical

Revenue table stale, PK duplicates. Page on-call, block downstream publishes.

P1 — Warning

Volume anomaly, freshness approaching SLA. Slack channel, investigate within 4 hours.

P2 — Info

New column detected, minor NULL rate increase. Log to dashboard, review weekly.

Alert Fires

→

Severity Tier

→

Route: Page / Slack / Log

→

Runbook: Step-by-step fix

Interview Tip

"Every alert should have an owner and a runbook. If an alert fires and nobody knows what to do, it's a bad alert. I classify by dataset criticality: revenue tables are P0, experimental features are P2."

Common Failure Modes

Know these by heart. Interviewers love asking "what could go wrong?" for any pipeline.

Missing Partitions

ETL skipped a day. Table looks fine but has a gap. Aggregate row count drops but row-level checks still pass.

Duplicate Rows

Retries, at-least-once delivery, bad MERGE logic. Revenue inflated, user counts wrong.

Timezone Bugs

Server in UTC, users in PST. Daily aggregates shift by 8 hours. "Monday" includes Sunday night.

Silent Truncation

VARCHAR(50) silently truncates a 60-char value. No error, just data loss. Test with max-length inputs.

Wrong Joins

JOIN on non-unique key → row explosion. 1M rows × 3 duplicates = 3M rows. Aggregates suddenly triple.

Type Coercion

"123" auto-cast to INT is fine. "123.45" → INT silently truncates to 123. "N/A" → NULL or error, depending on engine.