๐Ÿ” Identifying & Classifying PII / PHI

PII is any data that can identify a person. PHI adds health data protected under HIPAA.

The first step in any governance program is knowing where sensitive data lives.

PII Classification Levels

๐Ÿ”ด Restricted

SSN, credit card numbers, passwords, biometric data. Breach = regulatory fines + lawsuits.

๐ŸŸ  Confidential

Email, phone, address, date of birth, IP address. Requires access control and masking.

๐Ÿ”ต Internal

Employee ID, department, job title. Low risk but not public. Standard access controls.

๐ŸŸข Public

Company name, product catalog, published reports. No restrictions needed.

๐Ÿ’ก Interview Gold

"Classification is the prerequisite for every other control. You can't mask what you haven't found, you can't retain what you haven't classified, and you can't comply with GDPR if you don't know where PII lives."

๐ŸŽญ Data Masking: Static vs Dynamic

Masking hides sensitive values while keeping data usable for analytics, testing, or development.

See it in action:

FieldOriginalMasked
Emailjohn.doe@acme.comj***@***.com
SSN123-45-6789***-**-6789
Credit Card4111-1111-1111-1234****-****-****-1234
Phone+1-555-867-5309+1-555-***-****
NameJohn DoeJ*** D**

๐Ÿ“‹ Static Masking

Creates a permanent copy with masked values. Used for dev/test environments. Original data is replaced โ€” cannot be reversed.

โšก Dynamic Masking

Masks at query time based on the user's role. No data copy needed. Production data stays intact. Different users see different views.

๐ŸŽฏ Interview Tip

"Static masking is for non-production copies (dev, QA, analytics sandboxes). Dynamic masking is for production where the same table serves users with different access levels."

๐Ÿ”‘ Tokenization vs Encryption vs Masking

Three techniques, three different purposes. Interviewers love asking "when would you use each?"

๐ŸŽญ Masking

  • How: Replace chars with *
  • Reversible: No
  • Format: Preserved
  • Use case: Display, testing
  • Analytics: Limited (no joins)
  • Performance: Fast

๐Ÿ”‘ Tokenization

  • How: Replace with random token
  • Reversible: Yes (with vault)
  • Format: Preserved or not
  • Use case: Analytics, payments
  • Analytics: Full (join on tokens)
  • Performance: Moderate

๐Ÿ” Encryption

  • How: Mathematical transform
  • Reversible: Yes (with key)
  • Format: Destroyed
  • Use case: Storage, transit
  • Analytics: None (ciphertext)
  • Performance: Slower
๐Ÿ’ก Interview Gold

"Use tokenization when analytics needs stable identifiers without raw PII (e.g., join user journeys across tables using tokens). Use encryption for data at rest and in transit. Use masking for display and non-production environments."

๐Ÿงช Pseudonymization & Secure Sharing

Pseudonymization is a GDPR-recognized technique. It replaces identifiers but keeps the data analytically useful.

Raw PII
john@acme.com
โ†’
Pseudonymize
Hash + salt
โ†’
Token: a7f3b2c
Analytics-safe

Secure Data Sharing Patterns

๐Ÿ”— Clean Rooms

Two companies match data without revealing raw records. Example: advertiser matches customer lists with publisher without sharing emails directly.

๐Ÿ“Š Aggregation

Share aggregated metrics (counts, averages) instead of row-level data. K-anonymity ensures no individual can be re-identified from small groups.

๐Ÿ—ƒ๏ธ Tokenized Exports

Replace PII with tokens before exporting. The recipient can analyze patterns (user journeys, cohorts) without knowing identities.

๐Ÿ”’ Differential Privacy

Add mathematical noise to query results so individual records can't be inferred. Used by Apple, Google, and the US Census.

๐ŸŽฏ Interview Tip

"When asked about sharing data externally, start with: 'What's the minimum data needed?' Then choose: aggregation > tokenization > clean rooms > raw access. Each step up in granularity requires stronger controls."

Quiz: Test Yourself

Q1: Your analytics team needs to join user behavior across 5 tables without seeing real emails. Best approach?

Q2: Developers need a copy of production data for testing, but it contains SSNs. Best approach?

Q3: Two companies want to find overlapping customers without sharing raw PII. Solution?

Q4: You join a new company. Their data lake has 500 tables. First governance step for PII?

Q5: Why can't you simply encrypt PII columns and call it done for analytics?