Article 17 of GDPR gives individuals the right to request deletion of their personal data. This is the hardest governance question for data engineers.
The challenge: data exists in raw, staging, serving, and backup layers. Deleting from one isn't enough.
Actually remove the rows. Works in databases and Delta Lake (DELETE WHERE). Expensive for large immutable datasets.
Encrypt PII with per-user keys. To "forget" a user, destroy their key. Data becomes unreadable without re-processing.
Store PII in a separate table linked by user_id. Delete only the PII table. Analytical data remains intact but de-identified.
"The separation pattern is the most practical for lakehouses: store personal data in a small, mutable table. When GDPR hits, delete from that table only. Analytics tables referencing user_id become anonymized automatically."
Retention policies define how long data is kept and when it must be deleted or archived.
Every regulation has different requirements. You must balance legal minimums with business needs.
GDPR: as short as possible. SOX: 7 years for financial records. HIPAA: 6 years for health records. PCI-DSS: 1 year for access logs.
Raw: 90 days. Staging: 30 days. Serving: per business need. Backups: 30-90 days. Each tier has different retention.
Partition expiry (BigQuery), lifecycle rules (S3), scheduled delete jobs, snapshot expiration (Delta Lake), TTL policies.
"Retention isn't just 'keep for X days.' It's a policy per data tier per regulation. Raw data may need short retention (GDPR minimization), while financial summaries need 7+ years (SOX). Automate enforcement — manual processes always drift."
Auditing answers: "Who accessed what, when, and from where?" It's the backbone of compliance proof.
Who ran what query, on which tables, at what time. Include the SQL text, rows scanned, and result size.
Who granted/revoked what access, to whom, and when. Track role assignments and policy modifications.
Inserts, updates, deletes on sensitive tables. Critical for detecting unauthorized changes.
Schema changes, configuration updates, user creation. Everything an admin does should be logged.
Failed logins, permission-denied queries, blocked exports. These are security signals.
"Audit logs must be immutable, queryable, and retained per policy. Store them separately from operational data. Use append-only storage (S3, GCS) and make them searchable for investigations. Never let the audited system control its own audit logs."
GDPR restricts transferring EU personal data outside the EU. This affects every global company.
SCCs (Standard Contractual Clauses): legal agreements between sender and receiver. Most common mechanism post-Privacy Shield invalidation.
Keep EU data in EU regions. Cloud providers offer region-specific storage. Some countries mandate local storage (Russia, China).
GDPR requires explicit, informed consent for data processing. Consent must be revocable. Track consent status per user per purpose.
Automate PII scanning, consent tracking, retention enforcement, and audit reporting. Manual compliance doesn't scale past 10 people.
"When asked about cross-border data: mention SCCs as the legal mechanism, data residency as the technical control, and pseudonymization as a risk reduction technique. GDPR allows pseudonymized data more flexibility than raw PII."