Freshness = how recently data was updated relative to its expected cadence. It's the #1 metric for data reliability.
"Real-time" = 5-minute SLA. "Daily batch" = 26-hour SLA (buffer for late runs). Define SLAs per table, not globally.
Use the event timestamp (when it happened), not load timestamp (when ETL ran). Load time hides late-arriving data.
A single pane showing freshness status for all critical tables. Green/yellow/red. Stakeholders check before trusting reports.
Alert fatigue kills data quality programs. If your team ignores alerts, you have no monitoring. Here's how to fix it.
Don't hardcode "row count > 1000." Use trailing 7-day average ± 2 standard deviations. Adapts to growth and seasonality.
Monday volume differs from Saturday. Holiday weeks differ from normal. Your baselines should account for day-of-week and known events.
Alert only when freshness AND volume AND error rate all degrade. Single-signal alerts have high false positive rates.
Suppress alerts during planned maintenance, deployments, and known data source outages. Auto-resume after the window.
"I use dynamic thresholds with seasonality baselines and multi-signal correlation. If volume drops AND freshness degrades, that's a real incident. Volume alone dropping on Sunday is probably normal."
Anomaly detection catches silent failures — data that looks valid but is wrong.
Anomaly detection is a complement, not a replacement. Use explicit rules for known invariants (amount >= 0) and anomaly detection for unknown unknowns (unexpected metric shifts).
Not all alerts deserve the same response. Tier your alerts and route them appropriately.
Revenue table stale, PK duplicates. Page on-call, block downstream publishes.
Volume anomaly, freshness approaching SLA. Slack channel, investigate within 4 hours.
New column detected, minor NULL rate increase. Log to dashboard, review weekly.
"Every alert should have an owner and a runbook. If an alert fires and nobody knows what to do, it's a bad alert. I classify by dataset criticality: revenue tables are P0, experimental features are P2."
Know these by heart. Interviewers love asking "what could go wrong?" for any pipeline.
ETL skipped a day. Table looks fine but has a gap. Aggregate row count drops but row-level checks still pass.
Retries, at-least-once delivery, bad MERGE logic. Revenue inflated, user counts wrong.
Server in UTC, users in PST. Daily aggregates shift by 8 hours. "Monday" includes Sunday night.
VARCHAR(50) silently truncates a 60-char value. No error, just data loss. Test with max-length inputs.
JOIN on non-unique key → row explosion. 1M rows × 3 duplicates = 3M rows. Aggregates suddenly triple.
"123" auto-cast to INT is fine. "123.45" → INT silently truncates to 123. "N/A" → NULL or error, depending on engine.