Module 4: PII, Masking & Tokenization | Data Governance Interview Mastery

🔍 Identifying & Classifying PII / PHI

PII is any data that can identify a person. PHI adds health data protected under HIPAA.

The first step in any governance program is knowing where sensitive data lives.

PII Classification Levels

🔴 Restricted

SSN, credit card numbers, passwords, biometric data. Breach = regulatory fines + lawsuits.

🟠 Confidential

Email, phone, address, date of birth, IP address. Requires access control and masking.

🔵 Internal

Employee ID, department, job title. Low risk but not public. Standard access controls.

🟢 Public

Company name, product catalog, published reports. No restrictions needed.

💡 Interview Gold

"Classification is the prerequisite for every other control. You can't mask what you haven't found, you can't retain what you haven't classified, and you can't comply with GDPR if you don't know where PII lives."

🎭 Data Masking: Static vs Dynamic

Masking hides sensitive values while keeping data usable for analytics, testing, or development.

See it in action:

Field	Original	Masked
Email	john.doe@acme.com	j*@*.com
SSN	123-45-6789	*--6789
Credit Card	4111-1111-1111-1234	**--**-1234
Phone	+1-555-867-5309	+1-555-*-**
Name	John Doe	J* D

📋 Static Masking

Creates a permanent copy with masked values. Used for dev/test environments. Original data is replaced — cannot be reversed.

⚡ Dynamic Masking

Masks at query time based on the user's role. No data copy needed. Production data stays intact. Different users see different views.

🎯 Interview Tip

"Static masking is for non-production copies (dev, QA, analytics sandboxes). Dynamic masking is for production where the same table serves users with different access levels."

🔑 Tokenization vs Encryption vs Masking

Three techniques, three different purposes. Interviewers love asking "when would you use each?"

🎭 Masking

How: Replace chars with *
Reversible: No
Format: Preserved
Use case: Display, testing
Analytics: Limited (no joins)
Performance: Fast

🔑 Tokenization

How: Replace with random token
Reversible: Yes (with vault)
Format: Preserved or not
Use case: Analytics, payments
Analytics: Full (join on tokens)
Performance: Moderate

🔐 Encryption

How: Mathematical transform
Reversible: Yes (with key)
Format: Destroyed
Use case: Storage, transit
Analytics: None (ciphertext)
Performance: Slower

💡 Interview Gold

"Use tokenization when analytics needs stable identifiers without raw PII (e.g., join user journeys across tables using tokens). Use encryption for data at rest and in transit. Use masking for display and non-production environments."

🧪 Pseudonymization & Secure Sharing

Pseudonymization is a GDPR-recognized technique. It replaces identifiers but keeps the data analytically useful.

Raw PII
john@acme.com

→

Pseudonymize
Hash + salt

→

Token: a7f3b2c
Analytics-safe

Secure Data Sharing Patterns

🔗 Clean Rooms

Two companies match data without revealing raw records. Example: advertiser matches customer lists with publisher without sharing emails directly.

📊 Aggregation

Share aggregated metrics (counts, averages) instead of row-level data. K-anonymity ensures no individual can be re-identified from small groups.

🗃️ Tokenized Exports

Replace PII with tokens before exporting. The recipient can analyze patterns (user journeys, cohorts) without knowing identities.

🔒 Differential Privacy

Add mathematical noise to query results so individual records can't be inferred. Used by Apple, Google, and the US Census.