👥 Roles in Data Governance

Every governance program fails without clear accountability. "Nobody owns it" is the #1 governance failure mode.

Three core roles appear in every interview. Know who does what.

👑 Data Owner

Typically: VP/Director of the business domain

🔧 Data Steward

Typically: Senior analyst or data lead in the domain

⚙️ Data Custodian

Typically: Data engineer or platform engineer
🎯 Interview Tip

The owner says "what", the steward says "how well", and the custodian says "how technically." If asked who approves a new dataset, it's the owner. If asked who fixes a broken schema, it's the custodian.

📖 Business Glossary vs Technical Metadata

These are two different layers of "data about data." Interviewers test whether you understand the distinction.

📖 Business Glossary

  • What: Approved definitions of business terms
  • Example: "Active user = logged in within 30 days"
  • Owner: Business stakeholders
  • Purpose: Align meaning across teams
  • Pain without: "My MRR ≠ your MRR"
  • Tools: Atlan, Alation, Collibra

🔩 Technical Metadata

  • What: Schemas, types, lineage, partitions
  • Example: "Column user_id: INT64, not null"
  • Owner: Data engineers
  • Purpose: Enable discovery and debugging
  • Pain without: "What table has revenue?"
  • Tools: DataHub, OpenMetadata, dbt
💡 Interview Gold

"A business glossary aligns meaning. Technical metadata enables implementation. You need both: the glossary says what 'net revenue' means, and the technical metadata tells you which table, column, and pipeline produces it."

📚 What Is a Data Catalog?

A data catalog is a searchable inventory of all datasets in your organization, enriched with metadata.

Think of it as "Google for your data." Without one, analysts spend 30% of their time just finding the right table.

What a catalog should contain:

Metadata TypeExampleWhy It Matters
📝 Description"Daily aggregated order metrics"Know what data represents
👤 Owner"Commerce team — Jane Doe"Know who to ask questions
🔗 Lineage"orders_raw → orders_clean → orders_daily"Trace data flow
🏷️ Classification"Contains PII: email, phone"Know sensitivity level
⏰ Freshness"Updated daily at 03:00 UTC"Know if data is current
📊 Schema"12 columns, partitioned by date"Understand structure
📈 Popularity"Queried 340 times/week"Find trusted datasets
🎯 Interview Tip

Name real tools: DataHub (open-source, LinkedIn), Atlan (cloud-native), Alation (enterprise), OpenMetadata (open-source). Mentioning specific tools shows hands-on experience.

🔗 Data Lineage & Impact Analysis

Lineage tracks how data flows from source to consumption. It answers: "Where did this number come from?"

Raw Events
Kafka topic
Staging
stg_events
Mart
fct_orders
Dashboard
Revenue KPI

Why lineage matters in interviews:

🔍 Root Cause

"Revenue dropped 20%" → trace lineage back to find the broken upstream pipeline.

💥 Impact Analysis

"If I change this column, what breaks?" Lineage shows all downstream dependents before you ship.

📋 Compliance

"Where does user email appear?" Lineage reveals every table that touches PII for GDPR requests.

💡 Interview Gold

"Column-level lineage is the gold standard. Table-level tells you which tables are connected. Column-level tells you exactly which fields flow where — critical for PII tracking and impact analysis."

🔎 Data Discovery & Self-Service

The goal of governance isn't to lock data down. It's to make the right data easy to find and safe to use.

🏅 Certification

Mark datasets as "certified" = reviewed, accurate, and maintained. Analysts prioritize certified sources.

🔍 Search

Full-text search across table names, column names, descriptions, and tags. "Find tables with customer email."

📊 Usage Signals

Show query frequency, who uses it, and which dashboards depend on it. Popular = probably trusted.

🤖 Auto-Classification

ML-based scanners detect PII patterns (emails, SSNs, phone numbers) and auto-tag columns.

🎯 Interview Tip

"Self-service doesn't mean 'no rules.' It means guardrails that guide users to the right data with the right access. Think of it as a highway with lanes, not a locked gate."

Quiz: Test Yourself

Q1: A new dataset needs a classification decision (public vs confidential). Who makes the call?

Q2: You need to find every location where a customer's email address appears for a GDPR deletion. What helps most?

Q3: "Finance says 'revenue' is $10M, but marketing says $12M." Root cause?

Q4: A startup buys an expensive data catalog. Six months later, half the entries are stale. Best fix?

Q5: An engineer wants to rename a column in a staging table. What governance practice prevents breakage?