📋 GDPR "Right to Be Forgotten"

Article 17 of GDPR gives individuals the right to request deletion of their personal data. This is the hardest governance question for data engineers.

The challenge: data exists in raw, staging, serving, and backup layers. Deleting from one isn't enough.

Implementation strategies:

Deletion Request
Find All PII
Column lineage
Delete / Anonymize
Per layer
Verify & Log
Audit trail

🗑️ Hard Delete

Actually remove the rows. Works in databases and Delta Lake (DELETE WHERE). Expensive for large immutable datasets.

🔒 Crypto Shredding

Encrypt PII with per-user keys. To "forget" a user, destroy their key. Data becomes unreadable without re-processing.

🧩 Separation Pattern

Store PII in a separate table linked by user_id. Delete only the PII table. Analytical data remains intact but de-identified.

💡 Interview Gold

"The separation pattern is the most practical for lakehouses: store personal data in a small, mutable table. When GDPR hits, delete from that table only. Analytics tables referencing user_id become anonymized automatically."

⏰ Data Retention Policies

Retention policies define how long data is kept and when it must be deleted or archived.

Every regulation has different requirements. You must balance legal minimums with business needs.

📊 By Regulation

GDPR: as short as possible. SOX: 7 years for financial records. HIPAA: 6 years for health records. PCI-DSS: 1 year for access logs.

🔄 By Data Tier

Raw: 90 days. Staging: 30 days. Serving: per business need. Backups: 30-90 days. Each tier has different retention.

⚙️ Enforcement Methods

Partition expiry (BigQuery), lifecycle rules (S3), scheduled delete jobs, snapshot expiration (Delta Lake), TTL policies.

Enforcement pipeline:

-- BigQuery: Set partition expiry (auto-delete after 90 days) ALTER TABLE raw.events SET OPTIONS ( partition_expiration_days = 90 ); -- Delta Lake: VACUUM removes old files -- Keeps data for retention period, then deletes VACUUM events RETAIN 30 DAYS;
🎯 Interview Tip

"Retention isn't just 'keep for X days.' It's a policy per data tier per regulation. Raw data may need short retention (GDPR minimization), while financial summaries need 7+ years (SOX). Automate enforcement — manual processes always drift."

📝 Audit Logging & Access Trails

Auditing answers: "Who accessed what, when, and from where?" It's the backbone of compliance proof.

What to log:

🔍 Query Access

Who ran what query, on which tables, at what time. Include the SQL text, rows scanned, and result size.

🔑 Permission Changes

Who granted/revoked what access, to whom, and when. Track role assignments and policy modifications.

📊 Data Modifications

Inserts, updates, deletes on sensitive tables. Critical for detecting unauthorized changes.

👤 Admin Actions

Schema changes, configuration updates, user creation. Everything an admin does should be logged.

🚨 Failed Attempts

Failed logins, permission-denied queries, blocked exports. These are security signals.

💡 Interview Gold

"Audit logs must be immutable, queryable, and retained per policy. Store them separately from operational data. Use append-only storage (S3, GCS) and make them searchable for investigations. Never let the audited system control its own audit logs."

🌍 Cross-Border Transfers & Consent

GDPR restricts transferring EU personal data outside the EU. This affects every global company.

🌍 Transfer Mechanisms

SCCs (Standard Contractual Clauses): legal agreements between sender and receiver. Most common mechanism post-Privacy Shield invalidation.

🏢 Data Residency

Keep EU data in EU regions. Cloud providers offer region-specific storage. Some countries mandate local storage (Russia, China).

✅ Consent Management

GDPR requires explicit, informed consent for data processing. Consent must be revocable. Track consent status per user per purpose.

🤖 Compliance Automation

Automate PII scanning, consent tracking, retention enforcement, and audit reporting. Manual compliance doesn't scale past 10 people.

Consent data model:

-- Track consent per user per purpose CREATE TABLE consent_records ( user_id STRING, purpose STRING, -- 'marketing', 'analytics', 'personalization' consented BOOLEAN, granted_at TIMESTAMP, revoked_at TIMESTAMP, -- NULL if still active source STRING -- 'web_form', 'mobile_app', 'email' );
🎯 Interview Tip

"When asked about cross-border data: mention SCCs as the legal mechanism, data residency as the technical control, and pseudonymization as a risk reduction technique. GDPR allows pseudonymized data more flexibility than raw PII."

Quiz: Test Yourself

Q1: Your Delta Lake has 500TB of historical data. A GDPR deletion request arrives. Most practical approach?

Q2: What is "crypto shredding" in the context of GDPR?

Q3: An engineer suggests storing audit logs in the same database being audited. Problem?

Q4: Your US-based company processes EU customer data. Legal mechanism for the transfer?

Q5: GDPR says minimize retention. SOX says keep financial records 7 years. What do you do?