Scrubadub vs PrivacyFilter: which Python PII tool is right for your pipeline?
Scrubadub is a solid offline Python library for structured PII (emails, phones, credit cards) with zero external dependencies — good when data cannot leave your infrastructure. PrivacyFilter uses the OpenAI Privacy Filter API and catches context-dependent PII that regex misses — better for pre-LLM scrubbing and pipelines where accuracy matters more than air-gap compliance. Free tier at privacyfilter.run.
Text is processed via OpenAI Privacy Filter API and is not stored server-side. See Privacy for details.
What each tool does — 30-second summary
Scrubadub is an open-source Python library (github.com/LeapBeyond/scrubadub) that detects and replaces PII in plain text using a combination of regular expressions and optional spaCy NLP models. It runs entirely on your machine — no API calls, no network traffic. You install it with pip install scrubadub, pass a string through scrubadub.clean(text), and get back a scrubbed version where entities are replaced with tokens like or .
PrivacyFilter is a hosted REST API backed by the OpenAI Privacy Filter API engine. You send a POST request to https://privacyfilter.run/api/redact and receive back a redacted string plus a JSON entity map (entities_found: [{type, original, replacement}]) showing exactly what was replaced. No NLP model to install or maintain — detection is handled server-side. A web UI at privacyfilter.run lets non-developers do one-off redactions without writing any code.
Detection approach: regex + NLP vs OpenAI Privacy Filter API
Scrubadub's detection pipeline is layered:
- Regex detectors handle structured patterns: email addresses, credit card numbers, phone numbers with common country formats, URLs.
- spaCy NER detectors (optional, loaded as plugins) handle names and locations using a trained statistical model.
- Custom detectors can be added by subclassing
scrubadub.detectors.Detectorand writing your own regex or logic.
This approach is fast and fully offline. The trade-off: regex detectors have hard-coded patterns, so unusual phone formats or non-English name structures can slip through. The spaCy NER models are good but treat each token somewhat independently — they can miss a name mentioned only once in a support ticket without an honorific prefix, or confuse company names with person names in ambiguous contexts.
PrivacyFilter sends text to the OpenAI Privacy Filter API, which uses a large language model to understand PII in context. The model reads the full passage before classifying entities, so it handles cases like:
- A first name only, referenced by a colleague in a chat log: "Ask Sarah to review it"
- An informal address: "drop it at the place near Central Park, 10th floor"
- A disguised email: "john dot smith at gmail"
These are patterns that fail regex and often fail spaCy NER because they lack the structural markers those systems rely on.
Supported entity types compared
| Entity type | Scrubadub | PrivacyFilter |
|---|---|---|
| NAME | Via spaCy plugin (optional) | ✓ Native |
| ✓ Regex | ✓ Native | |
| PHONE | ✓ Regex (limited locales) | ✓ Native |
| ADDRESS | Partial (spaCy GPE/LOC) | ✓ Native |
| SSN | US-only regex plugin | ✓ Native |
| CREDIT_CARD | ✓ Regex | ✓ Native |
| DATE_OF_BIRTH | No first-class support | ✓ Native |
| IP_ADDRESS | Via custom detector | ✓ Native |
| URL | ✓ Regex | ✓ Native |
| Custom regex patterns | ✓ Subclass Detector | Not currently supported |
One practical gap in Scrubadub: it does not expose an entities_found map by default — you get back the scrubbed text but not a structured list of what was replaced and with what token. PrivacyFilter returns entities_found: [{type, original, replacement}] in every response, which makes downstream pseudonymization (re-inserting original values into an LLM response) straightforward.
Accuracy in practice — where each tool struggles
Where Scrubadub struggles:
- Informal or non-English names (spaCy en_core_web_sm has high miss rate on non-Anglo names)
- Phone numbers in local or non-standard formats (e.g., Italian mobile without country code)
- Contextual PII — a user saying "my sister lives on Maple Street" will not trigger an address detector if no house number is present
- False positives on company names being tagged as person names
Where PrivacyFilter struggles:
- Highly domain-specific identifiers (e.g., internal employee IDs, proprietary system codes) — the model has no schema for these unless they look like structured patterns
- Very long documents: each credit covers up to 10,000 characters; larger documents need to be chunked
- Situations where the model must make a judgment call about borderline cases — e.g., is "the doctor" in a medical note a PII reference? The model may vary
Practical rule of thumb: If your text is structured (CSV rows, form fields, log lines with a known schema), Scrubadub's regex detectors are fast, predictable, and sufficient. If your text is unstructured prose — support tickets, chat logs, free-form notes — the context-aware detection in PrivacyFilter will catch meaningfully more PII.
Integration: Scrubadub pip install vs PrivacyFilter REST API
Scrubadub installation and basic usage:
pip install scrubadub scrubadub-spacy spacy
python -m spacy download en_core_web_trf # larger, more accurate model
import scrubadub
import scrubadub_spacy
scrubber = scrubadub.Scrubber()
scrubber.add_detector(scrubadub_spacy.detectors.SpacyEntityDetector)
text = "Hi, I'm Maria Rossi, you can reach me at maria@example.com or +39 333 1234567."
clean = scrubber.clean(text)
print(clean)
# "Hi, I'm , you can reach me at or ."
You get back a scrubbed string. There is no built-in structured entity map — if you need one you have to cross-reference the original and scrubbed strings yourself, or use the lower-level scrubber.filth_from_text() API to inspect detected Filth objects before replacement.
PrivacyFilter REST API:
import httpx
LICENSE_KEY = "your-uuid-here" # get at privacyfilter.run
text = "Hi, I'm Maria Rossi, you can reach me at maria@example.com or +39 333 1234567."
response = httpx.post(
"https://privacyfilter.run/api/redact",
json={"text": text, "license_key": LICENSE_KEY, "mode": "replace"},
timeout=15,
).raise_for_status().json()
print(response["redacted_text"])
# "Hi, I'm [PERSON_1], you can reach me at [EMAIL_2] or [PHONE_3]."
for entity in response["entities_found"]:
print(entity) # {"type": "NAME", "original": "Maria Rossi", "replacement": "[PERSON_1]"}
No model download, no spaCy dependency, no GPU requirement. The trade-off is a network round-trip (~1–2 seconds) and the fact that text passes through external infrastructure (see the GDPR section below).
Using PrivacyFilter from the command line (curl):
curl -s -X POST https://privacyfilter.run/api/redact \
-H "Content-Type: application/json" \
-d '{"text": "Call John at 555-0100.", "license_key": "your-uuid", "mode": "replace"}' \
| python3 -m json.tool
Offline vs hosted: data residency and GDPR implications
Scrubadub processes everything locally. Text never leaves your machine or your infrastructure. This makes it the unambiguous choice when:
- You are processing data subject to strict data residency requirements (e.g., health records under HIPAA, EU financial data that must remain on-premises)
- Your organisation's security policy prohibits sending any text to third-party APIs
- You are operating in an air-gapped environment
PrivacyFilter sends text to the OpenAI Privacy Filter API for processing. PrivacyFilter's own servers do not store input text — only rate-limit metadata is written to SQLite. However, the text does pass through OpenAI's infrastructure, which means OpenAI's Data Processing Agreement (DPA) applies. For most GDPR use cases this is acceptable: OpenAI offers a DPA for API customers and does not use API data to train models by default. But you need to be aware of and document this transfer, especially for sensitive categories of personal data.
GDPR note: PrivacyFilter does not store your input text server-side. Processing occurs via OpenAI Privacy Filter API under OpenAI's DPA. For EU data: the transfer is covered by standard contractual clauses in OpenAI's DPA. See /privacy for the full data flow.
For many teams, the practical question is: are you sending text to an LLM anyway? If you are pre-processing support tickets before passing them to GPT-4o, the data is already going to OpenAI — using PrivacyFilter to strip PII first is strictly privacy-improving, not privacy-reducing.
Pricing and cost at scale
| Tier | Scrubadub | PrivacyFilter |
|---|---|---|
| Free | Unlimited (self-hosted) | 3 redactions/day, up to 2,000 chars each |
| Low volume | Infra + maintenance costs | $9 for 50 credits (each = up to 10,000 chars) |
| Unlimited | Infra + maintenance costs | $19/month, unlimited, batch upload included |
| Enterprise | Self-managed (free licence) | Contact — not the primary target |
Running Scrubadub is free in licence terms, but not in total cost. A production deployment needs: a Python environment with spaCy models (the transformer model is ~500 MB), enough CPU/GPU to run inference at your required throughput, and ongoing maintenance when spaCy or scrubadub releases breaking changes. For a team processing thousands of documents per day, this infrastructure cost is easily justified. For a team processing dozens, the PrivacyFilter $19/month plan is almost certainly cheaper than the engineering time to maintain a self-hosted stack.
At the $9 credit pack level: 50 credits × 10,000 chars = 500,000 characters of text, which is roughly 75,000 words or around 300 typical support tickets. For occasional batch jobs, this is very cost-effective.
When to use Scrubadub
- Your data cannot leave your infrastructure (HIPAA, financial services, defence)
- You are processing high volumes of structured text with predictable PII patterns (log files, form submissions)
- You need custom detectors for organisation-specific identifiers (employee IDs, internal codes)
- You are already running a Python NLP stack and adding another dependency is no burden
- Latency is a hard constraint and you cannot afford a network round-trip per document
When to use PrivacyFilter
- You are scrubbing unstructured prose — support tickets, chat logs, free-form notes — where context-aware detection matters
- You need a clean JSON entity map to perform pseudonymization (re-insert original values into LLM output)
- You want to pre-process text before sending it to an LLM like GPT-4o or Claude — PrivacyFilter is purpose-built for this pipeline
- You do not want to manage spaCy model versions, GPU memory, or spaCy compatibility with other packages
- You need a web UI for non-developer team members (HR, legal, compliance) to do ad-hoc redactions
- Your volume is low to medium and the $19/month unlimited plan is cheaper than running your own infra
FAQ
Try PrivacyFilter free — no account required.
Paste any text and get PII redacted in under 2 seconds, with a full entity map. Free tier: 3 redactions/day up to 2,000 characters. Open the tool →