How to redact customer support logs before LLM fine-tuning
Fine-tuning a support bot on real customer conversations is one of the highest-ROI applications of LLMs for support teams. But raw support transcripts are a PII minefield: names, emails, order numbers, addresses, and sometimes payment details are scattered throughout every ticket.
Sending that data to an LLM provider's fine-tuning endpoint without scrubbing it first is a GDPR/CCPA risk — and may violate the provider's terms of service. This guide shows a production-ready pipeline for redacting PII from a support transcript dataset before it ever leaves your infrastructure.
What counts as PII in support transcripts
- Customer names (first name, last name, or "Hi [name]" patterns)
- Email addresses and phone numbers in the conversation body
- Physical shipping or billing addresses
- Order IDs and account numbers (may be PII under CCPA)
- Any customer-shared payment info (should never be in tickets, but sometimes is)
- Agent names (internal PII — anonymize too)
Note: order IDs are a grey area. Under GDPR they are personal data if linkable to an individual. For a fine-tuning dataset you're sharing externally, redact them to be safe.
Pipeline overview
- Export: Pull transcripts from your helpdesk (Zendesk, Intercom, Freshdesk) as JSONL
- Split: Separate each message turn; long tickets may exceed the 10k-char limit
- Redact: Call PrivacyFilter batch API per chunk
- Reassemble: Rebuild the conversation structure with redacted turns
- Format: Convert to fine-tuning JSONL (OpenAI or Anthropic format)
Step 1 — Load tickets from JSONL
import json
from pathlib import Path
def load_tickets(path: str) -> list[dict]:
tickets = []
with open(path) as f:
for line in f:
tickets.append(json.loads(line))
return tickets
# Expected shape per ticket:
# {
# "id": "123",
# "messages": [
# {"role": "customer", "content": "Hi, I'm Alice Brown. My order 44221 hasn't arrived."},
# {"role": "agent", "content": "Hi Alice! Let me look that up for you..."}
# ]
# }
Step 2 — Batch redact via PrivacyFilter API
import httpx
import asyncio
LICENSE_KEY = "your-uuid-here"
MAX_CHARS = 9_800 # leave headroom below 10k
def truncate(text: str) -> str:
return text[:MAX_CHARS] if len(text) > MAX_CHARS else text
async def redact_batch(client: httpx.AsyncClient, messages: list[dict]) -> list[dict]:
"""Redact the 'content' field of each message dict."""
docs = [{"id": str(i), "text": truncate(m["content"])} for i, m in enumerate(messages)]
r = await client.post(
"https://privacyfilter.run/api/redact/batch",
json={"documents": docs, "license_key": LICENSE_KEY, "mode": "replace"},
)
r.raise_for_status()
results = {item["id"]: item["redacted_text"] for item in r.json()["results"]}
return [
{**m, "content": results[str(i)]}
for i, m in enumerate(messages)
]
async def redact_all_tickets(tickets: list[dict]) -> list[dict]:
async with httpx.AsyncClient(timeout=60) as client:
tasks = [redact_batch(client, t["messages"]) for t in tickets]
redacted_messages = await asyncio.gather(*tasks)
return [
{**ticket, "messages": msgs}
for ticket, msgs in zip(tickets, redacted_messages)
]
tickets = load_tickets("support_export.jsonl")
clean_tickets = asyncio.run(redact_all_tickets(tickets))
print(f"Redacted {len(clean_tickets)} tickets")
Step 3 — Format for fine-tuning (OpenAI JSONL)
def to_finetune_jsonl(tickets: list[dict], output_path: str):
with open(output_path, "w") as f:
for ticket in tickets:
messages = []
for m in ticket["messages"]:
role = "user" if m["role"] == "customer" else "assistant"
messages.append({"role": role, "content": m["content"]})
# Add system message
ft_example = {
"messages": [
{"role": "system", "content": "You are a helpful customer support agent."},
*messages,
]
}
f.write(json.dumps(ft_example) + "\n")
to_finetune_jsonl(clean_tickets, "support_clean_finetune.jsonl")
print("Fine-tune dataset written to support_clean_finetune.jsonl")
Handling very long tickets
Some support conversations span dozens of messages and many thousands of characters. For tickets that exceed the batch document limit, split per-message and redact each turn individually, then reassemble. The message-level granularity also means an error in one long message doesn't block the rest of the ticket.
Cost estimation for this workflow
Each batch call handles up to 20 messages. If your average ticket has 8 messages:
- 100 tickets → 800 messages → ~40 batch calls → well within Unlimited Monthly ($19/mo)
- 500 tickets/month → ~200 batch calls → still within 200/day soft cap on any given day
- If tickets are fewer than 50/month: Redact Pack at $9 one-time handles 250 individual messages
Verification step
Before uploading to fine-tuning, do a spot-check on 20 random tickets: search the output JSONL for known customer emails and names from your export. Zero results means redaction succeeded. This takes 2 minutes and can prevent a data incident.
Start the pipeline today — $19/month for unlimited redactions, batch API included.