How to anonymize text before sending it to ChatGPT
Every time you paste a support ticket, a user email, or a document into ChatGPT, you are potentially sending names, phone numbers, email addresses, and other PII to OpenAI's servers. For personal use, that might be fine. For any business workflow, it almost certainly isn't.
This guide covers the practical workflow for stripping PII from text before it touches any LLM API — with code examples you can drop into a Python pipeline today.
Why this matters legally
Under GDPR (EU) and CCPA (California), sending personal data to a third-party processor requires a legal basis and, in many cases, a Data Processing Agreement (DPA). OpenAI offers a DPA for API customers, but it only covers data sent through the API — not data pasted into ChatGPT's web UI by employees using personal accounts.
The common violation pattern: An employee copies a customer complaint (containing name, email, and order details) into ChatGPT to get a draft reply. No DPA covers this transfer. If that customer is an EU resident, you have a potential GDPR breach.
The cleanest fix: strip PII before the text ever leaves your infrastructure.
The anonymization workflow
The pattern is three steps:
- Detect PII entities and their character offsets
- Replace each entity with a placeholder (
[PERSON_1],[EMAIL_2], etc.) - Send the scrubbed text to the LLM; optionally re-insert names in the LLM's output
Step 3 is optional but powerful: if you keep a mapping of [PERSON_1] → "Alice", you can replace placeholders back into the LLM response, giving the user a personalized answer without ever exposing PII to the model.
Implementation with PrivacyFilter API
import httpx
import re
LICENSE_KEY = "your-uuid-here" # get at privacyfilter.run
def scrub_and_prompt(raw_text: str, user_question: str) -> str:
# 1. Redact PII
r = httpx.post(
"https://privacyfilter.run/api/redact",
json={"text": raw_text, "license_key": LICENSE_KEY, "mode": "replace"},
timeout=15,
).raise_for_status().json()
clean_text = r["redacted_text"]
entity_map = {e["replacement"]: e["original"] for e in r["entities"]}
# 2. Send to ChatGPT with scrubbed text
import openai
client = openai.OpenAI()
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer the user's question based on the provided text."},
{"role": "user", "content": f"Text: {clean_text}\n\nQuestion: {user_question}"},
],
)
answer = completion.choices[0].message.content
# 3. Re-insert original values (optional)
for placeholder, original in entity_map.items():
answer = answer.replace(placeholder, original)
return answer
# Example
ticket = "Hi, I'm Maria Rossi (maria@example.com). My order #4521 hasn't arrived."
reply = scrub_and_prompt(ticket, "Write a polite reply acknowledging the delay.")
print(reply)
The LLM sees: "Hi, I'm [PERSON_1] ([EMAIL_2]). My order #4521 hasn't arrived." — no PII exposed. The final reply re-inserts "Maria Rossi" and "maria@example.com" automatically.
Free-tier shortcut (no API key)
If you just need a quick scrub, paste your text into privacyfilter.run — no account needed, 3 free redactions per day. Copy the redacted output, paste into ChatGPT, done.
Handling PII in fine-tuning datasets
Fine-tuning a model on customer conversations? The same pattern applies at batch scale. Use the /api/redact/batch endpoint (paid plans) to process up to 20 documents per call:
import httpx, json
documents = [
{"id": "t1", "text": "Customer Alice Walker called about invoice 9912..."},
{"id": "t2", "text": "Support case from bob@widgets.io regarding..."},
]
r = httpx.post(
"https://privacyfilter.run/api/redact/batch",
json={"documents": documents, "license_key": LICENSE_KEY, "mode": "replace"},
timeout=60,
).raise_for_status().json()
for item in r["results"]:
print(item["id"], "→", item["redacted_text"][:80])
What about images and PDFs?
PrivacyFilter (and OpenAI Privacy Filter) operate on plain text. For PDFs, extract text with pdfplumber or pypdf first, redact, then re-generate the document. For images containing text (screenshots, scanned forms), use an OCR step (Tesseract, AWS Textract) before the redaction API.
Checklist before deploying to production
- Add the redaction step as middleware in your LLM wrapper — don't rely on developers remembering it
- Log entity counts (not the text itself) for compliance auditing
- Use
mode=replace(notmask) if you need to re-insert values downstream - Review your OpenAI DPA — it covers API but not web UI usage by employees
- Add PrivacyFilter as a sub-processor in your privacy policy if you use the hosted API
Try PrivacyFilter free — paste any text and get a clean, PII-free version in under 2 seconds.