← Blog  ·  April 28, 2026  ·  7 min read

How to redact customer support logs before LLM fine-tuning

Fine-tuning a support bot on real customer conversations is one of the highest-ROI applications of LLMs for support teams. But raw support transcripts are a PII minefield: names, emails, order numbers, addresses, and sometimes payment details are scattered throughout every ticket.

Sending that data to an LLM provider's fine-tuning endpoint without scrubbing it first is a GDPR/CCPA risk — and may violate the provider's terms of service. This guide shows a production-ready pipeline for redacting PII from a support transcript dataset before it ever leaves your infrastructure.

What counts as PII in support transcripts

Note: order IDs are a grey area. Under GDPR they are personal data if linkable to an individual. For a fine-tuning dataset you're sharing externally, redact them to be safe.

Pipeline overview

  1. Export: Pull transcripts from your helpdesk (Zendesk, Intercom, Freshdesk) as JSONL
  2. Split: Separate each message turn; long tickets may exceed the 10k-char limit
  3. Redact: Call PrivacyFilter batch API per chunk
  4. Reassemble: Rebuild the conversation structure with redacted turns
  5. Format: Convert to fine-tuning JSONL (OpenAI or Anthropic format)

Step 1 — Load tickets from JSONL

import json
from pathlib import Path

def load_tickets(path: str) -> list[dict]:
    tickets = []
    with open(path) as f:
        for line in f:
            tickets.append(json.loads(line))
    return tickets

# Expected shape per ticket:
# {
#   "id": "123",
#   "messages": [
#     {"role": "customer", "content": "Hi, I'm Alice Brown. My order 44221 hasn't arrived."},
#     {"role": "agent",    "content": "Hi Alice! Let me look that up for you..."}
#   ]
# }

Step 2 — Batch redact via PrivacyFilter API

import httpx
import asyncio

LICENSE_KEY = "your-uuid-here"
MAX_CHARS = 9_800  # leave headroom below 10k

def truncate(text: str) -> str:
    return text[:MAX_CHARS] if len(text) > MAX_CHARS else text

async def redact_batch(client: httpx.AsyncClient, messages: list[dict]) -> list[dict]:
    """Redact the 'content' field of each message dict."""
    docs = [{"id": str(i), "text": truncate(m["content"])} for i, m in enumerate(messages)]
    r = await client.post(
        "https://privacyfilter.run/api/redact/batch",
        json={"documents": docs, "license_key": LICENSE_KEY, "mode": "replace"},
    )
    r.raise_for_status()
    results = {item["id"]: item["redacted_text"] for item in r.json()["results"]}
    return [
        {**m, "content": results[str(i)]}
        for i, m in enumerate(messages)
    ]

async def redact_all_tickets(tickets: list[dict]) -> list[dict]:
    async with httpx.AsyncClient(timeout=60) as client:
        tasks = [redact_batch(client, t["messages"]) for t in tickets]
        redacted_messages = await asyncio.gather(*tasks)
    return [
        {**ticket, "messages": msgs}
        for ticket, msgs in zip(tickets, redacted_messages)
    ]

tickets = load_tickets("support_export.jsonl")
clean_tickets = asyncio.run(redact_all_tickets(tickets))
print(f"Redacted {len(clean_tickets)} tickets")

Step 3 — Format for fine-tuning (OpenAI JSONL)

def to_finetune_jsonl(tickets: list[dict], output_path: str):
    with open(output_path, "w") as f:
        for ticket in tickets:
            messages = []
            for m in ticket["messages"]:
                role = "user" if m["role"] == "customer" else "assistant"
                messages.append({"role": role, "content": m["content"]})
            # Add system message
            ft_example = {
                "messages": [
                    {"role": "system", "content": "You are a helpful customer support agent."},
                    *messages,
                ]
            }
            f.write(json.dumps(ft_example) + "\n")

to_finetune_jsonl(clean_tickets, "support_clean_finetune.jsonl")
print("Fine-tune dataset written to support_clean_finetune.jsonl")

Handling very long tickets

Some support conversations span dozens of messages and many thousands of characters. For tickets that exceed the batch document limit, split per-message and redact each turn individually, then reassemble. The message-level granularity also means an error in one long message doesn't block the rest of the ticket.

Cost estimation for this workflow

Each batch call handles up to 20 messages. If your average ticket has 8 messages:

Verification step

Before uploading to fine-tuning, do a spot-check on 20 random tickets: search the output JSONL for known customer emails and names from your export. Zero results means redaction succeeded. This takes 2 minutes and can prevent a data incident.

Start the pipeline today — $19/month for unlimited redactions, batch API included.

See pricing →

Keep reading