Every governance program fails without clear accountability. "Nobody owns it" is the #1 governance failure mode.
Three core roles appear in every interview. Know who does what.
The owner says "what", the steward says "how well", and the custodian says "how technically." If asked who approves a new dataset, it's the owner. If asked who fixes a broken schema, it's the custodian.
These are two different layers of "data about data." Interviewers test whether you understand the distinction.
"A business glossary aligns meaning. Technical metadata enables implementation. You need both: the glossary says what 'net revenue' means, and the technical metadata tells you which table, column, and pipeline produces it."
A data catalog is a searchable inventory of all datasets in your organization, enriched with metadata.
Think of it as "Google for your data." Without one, analysts spend 30% of their time just finding the right table.
| Metadata Type | Example | Why It Matters |
|---|---|---|
| 📝 Description | "Daily aggregated order metrics" | Know what data represents |
| 👤 Owner | "Commerce team — Jane Doe" | Know who to ask questions |
| 🔗 Lineage | "orders_raw → orders_clean → orders_daily" | Trace data flow |
| 🏷️ Classification | "Contains PII: email, phone" | Know sensitivity level |
| ⏰ Freshness | "Updated daily at 03:00 UTC" | Know if data is current |
| 📊 Schema | "12 columns, partitioned by date" | Understand structure |
| 📈 Popularity | "Queried 340 times/week" | Find trusted datasets |
Name real tools: DataHub (open-source, LinkedIn), Atlan (cloud-native), Alation (enterprise), OpenMetadata (open-source). Mentioning specific tools shows hands-on experience.
Lineage tracks how data flows from source to consumption. It answers: "Where did this number come from?"
"Revenue dropped 20%" → trace lineage back to find the broken upstream pipeline.
"If I change this column, what breaks?" Lineage shows all downstream dependents before you ship.
"Where does user email appear?" Lineage reveals every table that touches PII for GDPR requests.
"Column-level lineage is the gold standard. Table-level tells you which tables are connected. Column-level tells you exactly which fields flow where — critical for PII tracking and impact analysis."
The goal of governance isn't to lock data down. It's to make the right data easy to find and safe to use.
Mark datasets as "certified" = reviewed, accurate, and maintained. Analysts prioritize certified sources.
Full-text search across table names, column names, descriptions, and tags. "Find tables with customer email."
Show query frequency, who uses it, and which dashboards depend on it. Popular = probably trusted.
ML-based scanners detect PII patterns (emails, SSNs, phone numbers) and auto-tag columns.
"Self-service doesn't mean 'no rules.' It means guardrails that guide users to the right data with the right access. Think of it as a highway with lanes, not a locked gate."