Freshness Monitoring & SLAs

Freshness = how recently data was updated relative to its expected cadence. It's the #1 metric for data reliability.

-- Freshness check: alert if table is >2 hours stale SELECT DATEDIFF('minute', MAX(updated_at), CURRENT_TIMESTAMP()) AS staleness_min FROM analytics.orders HAVING staleness_min > 120

Per-Dataset SLAs

"Real-time" = 5-minute SLA. "Daily batch" = 26-hour SLA (buffer for late runs). Define SLAs per table, not globally.

Measure Right Timestamp

Use the event timestamp (when it happened), not load timestamp (when ETL ran). Load time hides late-arriving data.

Freshness Dashboard

A single pane showing freshness status for all critical tables. Green/yellow/red. Stakeholders check before trusting reports.

Avoiding Noisy Alerts

Alert fatigue kills data quality programs. If your team ignores alerts, you have no monitoring. Here's how to fix it.

Dynamic Thresholds

Don't hardcode "row count > 1000." Use trailing 7-day average ± 2 standard deviations. Adapts to growth and seasonality.

Seasonality-Aware

Monday volume differs from Saturday. Holiday weeks differ from normal. Your baselines should account for day-of-week and known events.

Multi-Signal Correlation

Alert only when freshness AND volume AND error rate all degrade. Single-signal alerts have high false positive rates.

Maintenance Windows

Suppress alerts during planned maintenance, deployments, and known data source outages. Auto-resume after the window.

Interview Gold

"I use dynamic thresholds with seasonality baselines and multi-signal correlation. If volume drops AND freshness degrades, that's a real incident. Volume alone dropping on Sunday is probably normal."

Anomaly Detection for Metrics

Anomaly detection catches silent failures — data that looks valid but is wrong.

Good For

  • ✅ Revenue/metric spikes or drops
  • ✅ Distribution shifts in categorical data
  • ✅ Unusual NULL rate increases
  • ✅ Detecting silent pipeline failures

Watch Out For

  • ⚠️ Needs context: is it a bug or a real event?
  • ⚠️ Requires enough history for baselines
  • ⚠️ Can produce false positives during sales/launches
  • ⚠️ Doesn't replace explicit validation rules
Key Insight

Anomaly detection is a complement, not a replacement. Use explicit rules for known invariants (amount >= 0) and anomaly detection for unknown unknowns (unexpected metric shifts).

Alert Routing & Severity

Not all alerts deserve the same response. Tier your alerts and route them appropriately.

P0 — Critical

Revenue table stale, PK duplicates. Page on-call, block downstream publishes.

P1 — Warning

Volume anomaly, freshness approaching SLA. Slack channel, investigate within 4 hours.

P2 — Info

New column detected, minor NULL rate increase. Log to dashboard, review weekly.

Alert Fires
Severity Tier
Route: Page / Slack / Log
Runbook: Step-by-step fix
Interview Tip

"Every alert should have an owner and a runbook. If an alert fires and nobody knows what to do, it's a bad alert. I classify by dataset criticality: revenue tables are P0, experimental features are P2."

Common Failure Modes

Know these by heart. Interviewers love asking "what could go wrong?" for any pipeline.

Missing Partitions

ETL skipped a day. Table looks fine but has a gap. Aggregate row count drops but row-level checks still pass.

Duplicate Rows

Retries, at-least-once delivery, bad MERGE logic. Revenue inflated, user counts wrong.

Timezone Bugs

Server in UTC, users in PST. Daily aggregates shift by 8 hours. "Monday" includes Sunday night.

Silent Truncation

VARCHAR(50) silently truncates a 60-char value. No error, just data loss. Test with max-length inputs.

Wrong Joins

JOIN on non-unique key → row explosion. 1M rows × 3 duplicates = 3M rows. Aggregates suddenly triple.

Type Coercion

"123" auto-cast to INT is fine. "123.45" → INT silently truncates to 123. "N/A" → NULL or error, depending on engine.

Quiz: Test Yourself

Q1: Your pipeline's row count grows 5% weekly. How should you set volume alert thresholds?

Q2: Revenue suddenly drops 60% but all NOT NULL and range checks pass. What catches this?

Q3: Your daily report suddenly shows 3x normal revenue. Pipeline didn't change. What's the likely cause?

Q4: Your data source has planned maintenance every Sunday 2-4 AM. How do you handle alerts?