Module 2: Roles, Ownership & Catalogs | Data Governance Interview Mastery

👥 Roles in Data Governance

Every governance program fails without clear accountability. "Nobody owns it" is the #1 governance failure mode.

Three core roles appear in every interview. Know who does what.

👑 Data Owner

Typically: VP/Director of the business domain

Accountable for the dataset's business definition and quality
Approves who gets access and at what level
Decides classification (public, internal, confidential, restricted)
Signs off on retention and deletion policies

🔧 Data Steward

Typically: Senior analyst or data lead in the domain

Responsible for day-to-day metadata and quality
Maintains business glossary entries and documentation
Monitors data quality metrics and resolves issues
Coordinates between business and engineering

⚙️ Data Custodian

Typically: Data engineer or platform engineer

Operates the technical platform
Implements access controls, encryption, and backups
Manages storage, compute, and pipeline infrastructure
Executes retention policies (partition expiry, lifecycle rules)

🎯 Interview Tip

The owner says "what", the steward says "how well", and the custodian says "how technically." If asked who approves a new dataset, it's the owner. If asked who fixes a broken schema, it's the custodian.

📖 Business Glossary vs Technical Metadata

These are two different layers of "data about data." Interviewers test whether you understand the distinction.

📖 Business Glossary

What: Approved definitions of business terms
Example: "Active user = logged in within 30 days"
Owner: Business stakeholders
Purpose: Align meaning across teams
Pain without: "My MRR ≠ your MRR"
Tools: Atlan, Alation, Collibra

🔩 Technical Metadata

What: Schemas, types, lineage, partitions
Example: "Column user_id: INT64, not null"
Owner: Data engineers
Purpose: Enable discovery and debugging
Pain without: "What table has revenue?"
Tools: DataHub, OpenMetadata, dbt

💡 Interview Gold

"A business glossary aligns meaning. Technical metadata enables implementation. You need both: the glossary says what 'net revenue' means, and the technical metadata tells you which table, column, and pipeline produces it."

📚 What Is a Data Catalog?

A data catalog is a searchable inventory of all datasets in your organization, enriched with metadata.

Think of it as "Google for your data." Without one, analysts spend 30% of their time just finding the right table.

What a catalog should contain:

Metadata Type	Example	Why It Matters
📝 Description	"Daily aggregated order metrics"	Know what data represents
👤 Owner	"Commerce team — Jane Doe"	Know who to ask questions
🔗 Lineage	"orders_raw → orders_clean → orders_daily"	Trace data flow
🏷️ Classification	"Contains PII: email, phone"	Know sensitivity level
⏰ Freshness	"Updated daily at 03:00 UTC"	Know if data is current
📊 Schema	"12 columns, partitioned by date"	Understand structure
📈 Popularity	"Queried 340 times/week"	Find trusted datasets

🎯 Interview Tip

Name real tools: DataHub (open-source, LinkedIn), Atlan (cloud-native), Alation (enterprise), OpenMetadata (open-source). Mentioning specific tools shows hands-on experience.

🔗 Data Lineage & Impact Analysis

Lineage tracks how data flows from source to consumption. It answers: "Where did this number come from?"

Raw Events
Kafka topic

→

Staging
stg_events

→

Mart
fct_orders

→

Dashboard
Revenue KPI

Why lineage matters in interviews:

🔍 Root Cause

"Revenue dropped 20%" → trace lineage back to find the broken upstream pipeline.

💥 Impact Analysis

"If I change this column, what breaks?" Lineage shows all downstream dependents before you ship.

📋 Compliance

"Where does user email appear?" Lineage reveals every table that touches PII for GDPR requests.

💡 Interview Gold

"Column-level lineage is the gold standard. Table-level tells you which tables are connected. Column-level tells you exactly which fields flow where — critical for PII tracking and impact analysis."

🔎 Data Discovery & Self-Service

The goal of governance isn't to lock data down. It's to make the right data easy to find and safe to use.

🏅 Certification

Mark datasets as "certified" = reviewed, accurate, and maintained. Analysts prioritize certified sources.

🔍 Search

Full-text search across table names, column names, descriptions, and tags. "Find tables with customer email."

📊 Usage Signals

Show query frequency, who uses it, and which dashboards depend on it. Popular = probably trusted.

🤖 Auto-Classification

ML-based scanners detect PII patterns (emails, SSNs, phone numbers) and auto-tag columns.

🎯 Interview Tip

"Self-service doesn't mean 'no rules.' It means guardrails that guide users to the right data with the right access. Think of it as a highway with lanes, not a locked gate."

👥 Roles in Data Governance

👑 Data Owner

🔧 Data Steward

⚙️ Data Custodian

📖 Business Glossary vs Technical Metadata

📖 Business Glossary

🔩 Technical Metadata

📚 What Is a Data Catalog?

What a catalog should contain:

🔗 Data Lineage & Impact Analysis

Why lineage matters in interviews:

🔍 Root Cause

💥 Impact Analysis

📋 Compliance

🔎 Data Discovery & Self-Service

🏅 Certification

🔍 Search

📊 Usage Signals

🤖 Auto-Classification

Quiz: Test Yourself

Q1: A new dataset needs a classification decision (public vs confidential). Who makes the call?

Q2: You need to find every location where a customer's email address appears for a GDPR deletion. What helps most?

Q3: "Finance says 'revenue' is $10M, but marketing says $12M." Root cause?

Q4: A startup buys an expensive data catalog. Six months later, half the entries are stale. Best fix?

Q5: An engineer wants to rename a column in a staging table. What governance practice prevents breakage?