GDPR text anonymization — compliance checklist for AI tooling
Anonymization is one of GDPR's most powerful exemptions. Truly anonymous data falls outside the regulation entirely — you can process it freely, store it indefinitely, and share it without restriction. But GDPR sets a high bar: anonymization must be irreversible. Pseudonymization (replacing names with tokens you can reverse) is explicitly not anonymization under GDPR.
This matters enormously for AI teams. If you redact PII before sending text to an LLM API, you need to understand whether that redaction is anonymization or pseudonymization — and the answer affects your legal obligations significantly.
Anonymization vs pseudonymization: the legal distinction
The Article 29 Working Party (now EDPB) Opinion 05/2014 defines anonymization as a process that "irreversibly prevents identification." Concretely:
- Anonymization: Replace "Maria Rossi" with
████and discard the mapping. No one can recover the original name. GDPR no longer applies to this text. - Pseudonymization: Replace "Maria Rossi" with
[PERSON_1]and keep a mapping somewhere. You (or an attacker with the mapping) can re-identify. GDPR still applies — you just have a reduced risk profile.
Most PII redaction workflows used with LLMs are pseudonymization by this definition, because teams keep the original for later re-insertion. That's fine — it's a legitimate GDPR safeguard — but don't claim it's "anonymized data" in your privacy policy.
The three anonymization tests (EDPB)
For a dataset to be considered anonymous, it must pass all three:
- Singling out: Can you isolate an individual in the dataset? (e.g., "the person with SSN 123-45-6789")
- Linkability: Can you link two records across datasets to identify the same person?
- Inference: Can you infer sensitive attributes about an individual from remaining data?
PII redaction of names and contact details usually passes tests 1 and 2 but may fail test 3 if sensitive attributes (health condition, salary, location history) remain in the text.
Practical checklist for AI tooling
Before processing
- Identify the legal basis for processing the original PII (consent, legitimate interest, contract)
- If sub-processing via an LLM API: sign a DPA with the provider (OpenAI, Anthropic, etc.)
- Document your sub-processors in your privacy policy (OpenAI, your PII redaction tool, cloud hosting)
- If using a hosted redaction API: verify it does not store input text (check privacy policy)
During redaction
- Use a context-aware detector — regex misses names in natural prose ("call Sarah back")
- Cover all GDPR personal data categories: names, emails, phones, addresses, IDs, dates of birth, IP addresses, biometric/health data in text
- Decide: are you pseudonymizing (keep mapping) or anonymizing (discard mapping)?
- Log only metadata (char count, entity count, timestamp) — never log the original or redacted text
- Apply redaction as close to the data source as possible — ideally before any network hop
After processing
- If pseudonymized: store the key→original mapping with the same security as the original PII
- Delete the mapping when no longer needed (honor deletion requests)
- Do not use the LLM's output as a basis for decisions about identified individuals without re-review
- Keep a DPIA (Data Protection Impact Assessment) if processing at scale or for sensitive categories
Special categories (Article 9)
Health data, political opinions, religious beliefs, sexual orientation, and racial/ethnic origin require explicit consent or specific derogations. If your text may contain these categories (e.g., medical records, HR communications), your redaction must be more thorough:
Redacting a name but leaving "patient's HIV diagnosis" in a support ticket does not anonymize the individual. Inference-risk must be assessed holistically, not just by counting PII fields removed.
OpenAI Privacy Filter (and PrivacyFilter's API) detects names, contacts, IDs, and financial data — but does not automatically classify inferred sensitive attributes. For Article 9 content, consider adding domain-specific keyword suppression on top of entity detection.
Retention and the right to erasure
If you use a hosted API that does not retain text (like PrivacyFilter — see our privacy policy), the input text effectively disappears after the API call. You still need to manage:
- The output text you store in your own database
- The key→original mapping if you pseudonymized
- LLM provider logs (check your provider's data retention settings)
CCPA parallel
California's CCPA and CPRA use the term "deidentified" rather than anonymized, but the principle is similar: data is deidentified if there is no "reasonable basis to believe that the information can be used to identify an individual." The technical threshold is comparable to GDPR's anonymization bar. If you're GDPR-compliant on anonymization, you're generally CCPA-compliant on deidentification.
Start redacting PII today — paste any text into PrivacyFilter and see detected entities in seconds.