← Blog  ·  May 6, 2026  ·  7 min read

Scrubadub vs PrivacyFilter: which Python PII tool is right for your pipeline?

TL;DR

Scrubadub is a solid offline Python library for structured PII (emails, phones, credit cards) with zero external dependencies — good when data cannot leave your infrastructure. PrivacyFilter uses the OpenAI Privacy Filter API and catches context-dependent PII that regex misses — better for pre-LLM scrubbing and pipelines where accuracy matters more than air-gap compliance. Free tier at privacyfilter.run.

Text is processed via OpenAI Privacy Filter API and is not stored server-side. See Privacy for details.

What each tool does — 30-second summary

Scrubadub is an open-source Python library (github.com/LeapBeyond/scrubadub) that detects and replaces PII in plain text using a combination of regular expressions and optional spaCy NLP models. It runs entirely on your machine — no API calls, no network traffic. You install it with pip install scrubadub, pass a string through scrubadub.clean(text), and get back a scrubbed version where entities are replaced with tokens like or .

PrivacyFilter is a hosted REST API backed by the OpenAI Privacy Filter API engine. You send a POST request to https://privacyfilter.run/api/redact and receive back a redacted string plus a JSON entity map (entities_found: [{type, original, replacement}]) showing exactly what was replaced. No NLP model to install or maintain — detection is handled server-side. A web UI at privacyfilter.run lets non-developers do one-off redactions without writing any code.

Detection approach: regex + NLP vs OpenAI Privacy Filter API

Scrubadub's detection pipeline is layered:

This approach is fast and fully offline. The trade-off: regex detectors have hard-coded patterns, so unusual phone formats or non-English name structures can slip through. The spaCy NER models are good but treat each token somewhat independently — they can miss a name mentioned only once in a support ticket without an honorific prefix, or confuse company names with person names in ambiguous contexts.

PrivacyFilter sends text to the OpenAI Privacy Filter API, which uses a large language model to understand PII in context. The model reads the full passage before classifying entities, so it handles cases like:

These are patterns that fail regex and often fail spaCy NER because they lack the structural markers those systems rely on.

Supported entity types compared

Entity type Scrubadub PrivacyFilter
NAMEVia spaCy plugin (optional)✓ Native
EMAIL✓ Regex✓ Native
PHONE✓ Regex (limited locales)✓ Native
ADDRESSPartial (spaCy GPE/LOC)✓ Native
SSNUS-only regex plugin✓ Native
CREDIT_CARD✓ Regex✓ Native
DATE_OF_BIRTHNo first-class support✓ Native
IP_ADDRESSVia custom detector✓ Native
URL✓ Regex✓ Native
Custom regex patterns✓ Subclass DetectorNot currently supported

One practical gap in Scrubadub: it does not expose an entities_found map by default — you get back the scrubbed text but not a structured list of what was replaced and with what token. PrivacyFilter returns entities_found: [{type, original, replacement}] in every response, which makes downstream pseudonymization (re-inserting original values into an LLM response) straightforward.

Accuracy in practice — where each tool struggles

Where Scrubadub struggles:

Where PrivacyFilter struggles:

Practical rule of thumb: If your text is structured (CSV rows, form fields, log lines with a known schema), Scrubadub's regex detectors are fast, predictable, and sufficient. If your text is unstructured prose — support tickets, chat logs, free-form notes — the context-aware detection in PrivacyFilter will catch meaningfully more PII.

Integration: Scrubadub pip install vs PrivacyFilter REST API

Scrubadub installation and basic usage:

pip install scrubadub scrubadub-spacy spacy
python -m spacy download en_core_web_trf  # larger, more accurate model
import scrubadub
import scrubadub_spacy

scrubber = scrubadub.Scrubber()
scrubber.add_detector(scrubadub_spacy.detectors.SpacyEntityDetector)

text = "Hi, I'm Maria Rossi, you can reach me at maria@example.com or +39 333 1234567."
clean = scrubber.clean(text)
print(clean)
# "Hi, I'm , you can reach me at  or ."

You get back a scrubbed string. There is no built-in structured entity map — if you need one you have to cross-reference the original and scrubbed strings yourself, or use the lower-level scrubber.filth_from_text() API to inspect detected Filth objects before replacement.

PrivacyFilter REST API:

import httpx

LICENSE_KEY = "your-uuid-here"  # get at privacyfilter.run

text = "Hi, I'm Maria Rossi, you can reach me at maria@example.com or +39 333 1234567."

response = httpx.post(
    "https://privacyfilter.run/api/redact",
    json={"text": text, "license_key": LICENSE_KEY, "mode": "replace"},
    timeout=15,
).raise_for_status().json()

print(response["redacted_text"])
# "Hi, I'm [PERSON_1], you can reach me at [EMAIL_2] or [PHONE_3]."

for entity in response["entities_found"]:
    print(entity)  # {"type": "NAME", "original": "Maria Rossi", "replacement": "[PERSON_1]"}

No model download, no spaCy dependency, no GPU requirement. The trade-off is a network round-trip (~1–2 seconds) and the fact that text passes through external infrastructure (see the GDPR section below).

Using PrivacyFilter from the command line (curl):

curl -s -X POST https://privacyfilter.run/api/redact \
  -H "Content-Type: application/json" \
  -d '{"text": "Call John at 555-0100.", "license_key": "your-uuid", "mode": "replace"}' \
  | python3 -m json.tool

Offline vs hosted: data residency and GDPR implications

Scrubadub processes everything locally. Text never leaves your machine or your infrastructure. This makes it the unambiguous choice when:

PrivacyFilter sends text to the OpenAI Privacy Filter API for processing. PrivacyFilter's own servers do not store input text — only rate-limit metadata is written to SQLite. However, the text does pass through OpenAI's infrastructure, which means OpenAI's Data Processing Agreement (DPA) applies. For most GDPR use cases this is acceptable: OpenAI offers a DPA for API customers and does not use API data to train models by default. But you need to be aware of and document this transfer, especially for sensitive categories of personal data.

GDPR note: PrivacyFilter does not store your input text server-side. Processing occurs via OpenAI Privacy Filter API under OpenAI's DPA. For EU data: the transfer is covered by standard contractual clauses in OpenAI's DPA. See /privacy for the full data flow.

For many teams, the practical question is: are you sending text to an LLM anyway? If you are pre-processing support tickets before passing them to GPT-4o, the data is already going to OpenAI — using PrivacyFilter to strip PII first is strictly privacy-improving, not privacy-reducing.

Pricing and cost at scale

Tier Scrubadub PrivacyFilter
FreeUnlimited (self-hosted)3 redactions/day, up to 2,000 chars each
Low volumeInfra + maintenance costs$9 for 50 credits (each = up to 10,000 chars)
UnlimitedInfra + maintenance costs$19/month, unlimited, batch upload included
EnterpriseSelf-managed (free licence)Contact — not the primary target

Running Scrubadub is free in licence terms, but not in total cost. A production deployment needs: a Python environment with spaCy models (the transformer model is ~500 MB), enough CPU/GPU to run inference at your required throughput, and ongoing maintenance when spaCy or scrubadub releases breaking changes. For a team processing thousands of documents per day, this infrastructure cost is easily justified. For a team processing dozens, the PrivacyFilter $19/month plan is almost certainly cheaper than the engineering time to maintain a self-hosted stack.

At the $9 credit pack level: 50 credits × 10,000 chars = 500,000 characters of text, which is roughly 75,000 words or around 300 typical support tickets. For occasional batch jobs, this is very cost-effective.

When to use Scrubadub

When to use PrivacyFilter

FAQ

Is Scrubadub still actively maintained?
Scrubadub receives periodic updates on GitHub, but it is not under heavy active development as of 2026. The core library is stable for its supported entity types, but new PII categories or model-based detection are unlikely to be added quickly. For active feature development, consider Presidio or a hosted API.
Does Scrubadub support SSNs, credit card numbers, and dates of birth?
Scrubadub supports credit card numbers and email addresses out of the box via regex. SSN detection is available as a locale-specific detector. Dates of birth are not reliably extracted as a first-class entity type — you would need to add a custom detector.
Can PrivacyFilter be used offline or self-hosted?
No. PrivacyFilter is a hosted API backed by OpenAI Privacy Filter. Text is processed via the API and is not stored on PrivacyFilter's servers, but it does pass through OpenAI infrastructure. If you require fully offline or on-premises processing, Scrubadub or Microsoft Presidio are better fits.
Which tool is better for redacting PII before sending text to an LLM?
PrivacyFilter is purpose-built for this use case. Its context-aware detection (powered by OpenAI Privacy Filter API) handles ambiguous entities that regex-based tools miss — for example, a name mentioned only once in a support ticket without a standard prefix. The JSON entity map it returns also makes it easy to re-insert values into the LLM's output.
What does PrivacyFilter cost compared to running Scrubadub yourself?
Scrubadub is free to run but has infrastructure and maintenance costs. PrivacyFilter offers a free tier (3 redactions/day, up to 2,000 characters per document), a $9 credit pack (50 credits, each covering documents up to 10,000 characters), and an unlimited plan at $19/month. For low-to-medium volume pipelines the hosted cost is often lower than maintaining a self-hosted NLP stack.

Try PrivacyFilter free — no account required.
Paste any text and get PII redacted in under 2 seconds, with a full entity map. Free tier: 3 redactions/day up to 2,000 characters. Open the tool →

See also