PII is any data that can identify a person. PHI adds health data protected under HIPAA.
The first step in any governance program is knowing where sensitive data lives.
SSN, credit card numbers, passwords, biometric data. Breach = regulatory fines + lawsuits.
Email, phone, address, date of birth, IP address. Requires access control and masking.
Employee ID, department, job title. Low risk but not public. Standard access controls.
Company name, product catalog, published reports. No restrictions needed.
"Classification is the prerequisite for every other control. You can't mask what you haven't found, you can't retain what you haven't classified, and you can't comply with GDPR if you don't know where PII lives."
Masking hides sensitive values while keeping data usable for analytics, testing, or development.
| Field | Original | Masked |
|---|---|---|
| john.doe@acme.com | j***@***.com | |
| SSN | 123-45-6789 | ***-**-6789 |
| Credit Card | 4111-1111-1111-1234 | ****-****-****-1234 |
| Phone | +1-555-867-5309 | +1-555-***-**** |
| Name | John Doe | J*** D** |
Creates a permanent copy with masked values. Used for dev/test environments. Original data is replaced โ cannot be reversed.
Masks at query time based on the user's role. No data copy needed. Production data stays intact. Different users see different views.
"Static masking is for non-production copies (dev, QA, analytics sandboxes). Dynamic masking is for production where the same table serves users with different access levels."
Three techniques, three different purposes. Interviewers love asking "when would you use each?"
"Use tokenization when analytics needs stable identifiers without raw PII (e.g., join user journeys across tables using tokens). Use encryption for data at rest and in transit. Use masking for display and non-production environments."
Pseudonymization is a GDPR-recognized technique. It replaces identifiers but keeps the data analytically useful.
Two companies match data without revealing raw records. Example: advertiser matches customer lists with publisher without sharing emails directly.
Share aggregated metrics (counts, averages) instead of row-level data. K-anonymity ensures no individual can be re-identified from small groups.
Replace PII with tokens before exporting. The recipient can analyze patterns (user journeys, cohorts) without knowing identities.
Add mathematical noise to query results so individual records can't be inferred. Used by Apple, Google, and the US Census.
"When asked about sharing data externally, start with: 'What's the minimum data needed?' Then choose: aggregation > tokenization > clean rooms > raw access. Each step up in granularity requires stronger controls."