🔥 The Interview Gauntlet

You've learned the theory. Now face the hardest scenario questions interviewers actually ask. Each one is designed to trip you up.

⚠️ Warning

These questions don't have simple answers. Interviewers want to see your thought process, not a memorized response. Talk through tradeoffs.

Scenario 1: The GDPR Fire Drill

"It's Friday 4pm. Legal just received a GDPR deletion request for 50,000 EU users. The 30-day deadline starts now. Your lakehouse has raw, staging, serving, and ML training layers. Walk me through your response."

1. Acknowledge scope: all layers where PII exists. Use column-level lineage to map. 2. Prioritize: delete from serving/ML layers first (actively consumed). 3. For raw/immutable layers: use the separation pattern or crypto shredding. 4. Document what was deleted, what was anonymized, and retain proof. 5. Verify with a post-deletion PII scan. 6. Update the consent management system.

Scenario 2: The Data Catalog Nobody Uses

"We spent $500K on a data catalog. Six months later, 20% of entries are accurate. Engineers bypass it entirely. How do you fix this?"

1. Root cause: no ownership model. Nobody is accountable. 2. Assign data owners and stewards per domain. Make it a job responsibility, not a favor. 3. Automate what you can: schema crawling, lineage extraction, freshness monitoring. 4. Integrate catalog into workflows (dbt docs → catalog, IDE search → catalog). 5. Gamify: show usage stats, reward certified datasets. 6. Start small: certify the top 20 most-queried datasets first.

Scenario 3: The Shadow Data Pipeline

"A marketing analyst built their own pipeline copying customer data to a personal S3 bucket for analysis. You just found out. What do you do?"

1. Don't panic or blame — this is a governance gap, not malice. 2. Immediately assess: what data was copied? Does it contain PII? Who has access to that S3 bucket? 3. Secure the bucket: encrypt, restrict access, audit. 4. Delete or migrate the data to a governed location. 5. Fix the root cause: why did the analyst go rogue? Probably because the governed path was too slow. 6. Create a self-service sandbox with proper guardrails.

🎯 More Scenario Questions

Scenario 4: The Metric Disagreement

"The CFO's revenue number doesn't match the VP of Sales' number. Both claim their dashboard is correct. You're brought in to investigate. What's your approach?"

1. Trace lineage for both dashboards back to source tables. 2. Compare definitions: "revenue" likely means different things (gross vs net, booked vs recognized, including returns or not). 3. Check the business glossary — if no entry exists, that's your root cause. 4. Align on a single certified definition with both stakeholders. 5. Create a governed semantic layer (metric store) as the single source of truth. 6. Deprecate the conflicting dashboards.

Scenario 5: The Multi-Region Problem

"Your company just expanded to the EU. All data currently lives in us-east-1. EU customers are signing up. What changes?"

1. GDPR applies to EU resident data regardless of where you process it. 2. Ideal: create an EU data region (eu-west-1). Route EU user data to EU storage. 3. For cross-region analytics: use pseudonymized/aggregated data transfers, backed by SCCs. 4. Update consent forms for EU users. 5. Implement GDPR deletion capabilities (Article 17). 6. Set up separate retention policies for EU vs US data.

Scenario 6: The Overprivileged Service Account

"You discover that a service account used by the ETL pipeline has ADMIN access to the entire data warehouse. It's been this way for 2 years. Now what?"

1. Don't revoke immediately — you'll break production. 2. Audit: what does this account actually access? Check query logs for the past 90 days. 3. Create a least-privilege role that covers exactly what the ETL pipeline needs. 4. Test the new role in staging. 5. Switch production to the new role with a rollback plan. 6. Set up alerts for any service account with ADMIN privileges. 7. Add access reviews to prevent this from happening again.

Scenario 7: The ML Model and PII

"Your ML team trained a recommendation model on customer purchase data that includes names and emails. The model is in production. A GDPR request arrives. Do you need to retrain the model?"

1. It depends on whether the model memorized PII. Most ML models don't store raw PII in weights, but training data lineage matters. 2. Delete the user from training data. 3. If the model is a simple statistical model, retraining may not be required (the individual's contribution is negligible). 4. If the model memorizes data (LLMs, nearest-neighbor), retrain without the user's data. 5. Going forward: train models on tokenized/pseudonymized data to avoid this entirely.

⚡ Rapid-Fire Q&A

Quick questions, quick answers. Know these cold.

Q: What's the difference between data quality and data governance?

Governance is the framework (who, what, why). Quality is one pillar within governance (is the data accurate, complete, timely?).

Q: Name three tools for data cataloging.

DataHub (open-source, LinkedIn), Atlan (cloud-native), Alation (enterprise). Bonus: OpenMetadata, Apache Atlas, Collibra.

Q: What's the maximum GDPR fine?

4% of global annual revenue or €20M, whichever is greater. Meta was fined €1.2B in 2023.

Q: What's a data contract?

An agreement between a data producer and consumer defining schema, quality expectations, SLAs, and ownership. Like an API contract but for data.

Q: How do you handle PII in a data lake?

Classify on ingestion → tag → mask/tokenize → enforce column-level security → audit access. Never let raw PII reach the analytics layer.

Q: What's the difference between anonymization and pseudonymization?

Anonymization is irreversible — data cannot be re-identified. Pseudonymization is reversible with additional info. GDPR considers pseudonymized data still personal data.

Q: What is "data mesh" and how does it affect governance?

Data mesh decentralizes data ownership to domain teams. Governance becomes federated: global policies + domain-level autonomy. The governance team sets standards; domains implement them.

Q: What's a data classification scheme?

A taxonomy for labeling data sensitivity: Public → Internal → Confidential → Restricted. Each level has different access controls, retention, and handling rules.

🏆 Final Gauntlet Quiz

Q1: You're hired as the first data governance lead. 200 tables, no catalog, no documentation. First 30 days?

Q2: Your company adopts data mesh. What happens to central governance?

Q3: Is pseudonymized data considered "personal data" under GDPR?

Q4: An auditor asks: "How do you ensure nobody can tamper with access logs?" Best answer?

Q5: A VP says "Governance will slow us down." How do you respond?

🎓 Common Pitfalls & Gotchas

❌ Pitfall: "Governance = buying a tool"

Tools are enablers, not solutions. A $500K catalog is useless without ownership, processes, and cultural buy-in.

❌ Pitfall: "Boil the ocean" approach

Don't try to govern 500 tables on day one. Start with the top 20 most-queried datasets. Expand incrementally.

❌ Pitfall: "Security = governance"

Locking everything down isn't governance. Governance enables access — it's about the right people getting the right data safely.

⚠️ Gotcha: Anonymization vs pseudonymization

Anonymized data = out of GDPR scope. Pseudonymized data = still personal data under GDPR. Many candidates get this wrong.

⚠️ Gotcha: "We encrypt everything, so we're compliant"

Encryption at rest doesn't protect against authorized insiders querying data. You still need access controls, masking, and audit trails.

⚠️ Gotcha: Retention ≠ deletion

Setting a retention policy doesn't mean data actually gets deleted. You need automated enforcement — partition expiry, lifecycle rules, scheduled jobs — and verification.

🏆
You've completed the Data Governance Interview Gauntlet!
Review any modules you found challenging. Good luck!