← Blog · May 6, 2026 · 7 min read

Scrubadub vs PrivacyFilter: which Python PII tool is right for your pipeline?

TL;DR

Scrubadub is a solid offline Python library for structured PII (emails, phones, credit cards) with zero external dependencies — good when data cannot leave your infrastructure. PrivacyFilter uses the OpenAI Privacy Filter API and catches context-dependent PII that regex misses — better for pre-LLM scrubbing and pipelines where accuracy matters more than air-gap compliance. Free tier at privacyfilter.run.

Text is processed via OpenAI Privacy Filter API and is not stored server-side. See Privacy for details.

What each tool does — 30-second summary

Scrubadub is an open-source Python library (github.com/LeapBeyond/scrubadub) that detects and replaces PII in plain text using a combination of regular expressions and optional spaCy NLP models. It runs entirely on your machine — no API calls, no network traffic. You install it with pip install scrubadub, pass a string through scrubadub.clean(text), and get back a scrubbed version where entities are replaced with tokens like or .

PrivacyFilter is a hosted REST API backed by the OpenAI Privacy Filter API engine. You send a POST request to https://privacyfilter.run/api/redact and receive back a redacted string plus a JSON entity map (entities_found: [{type, original, replacement}]) showing exactly what was replaced. No NLP model to install or maintain — detection is handled server-side. A web UI at privacyfilter.run lets non-developers do one-off redactions without writing any code.

Detection approach: regex + NLP vs OpenAI Privacy Filter API

Scrubadub's detection pipeline is layered:

Regex detectors handle structured patterns: email addresses, credit card numbers, phone numbers with common country formats, URLs.
spaCy NER detectors (optional, loaded as plugins) handle names and locations using a trained statistical model.
Custom detectors can be added by subclassing scrubadub.detectors.Detector and writing your own regex or logic.

This approach is fast and fully offline. The trade-off: regex detectors have hard-coded patterns, so unusual phone formats or non-English name structures can slip through. The spaCy NER models are good but treat each token somewhat independently — they can miss a name mentioned only once in a support ticket without an honorific prefix, or confuse company names with person names in ambiguous contexts.

PrivacyFilter sends text to the OpenAI Privacy Filter API, which uses a large language model to understand PII in context. The model reads the full passage before classifying entities, so it handles cases like:

A first name only, referenced by a colleague in a chat log: "Ask Sarah to review it"
An informal address: "drop it at the place near Central Park, 10th floor"
A disguised email: "john dot smith at gmail"

These are patterns that fail regex and often fail spaCy NER because they lack the structural markers those systems rely on.

Supported entity types compared

Entity type	Scrubadub	PrivacyFilter
NAME	Via spaCy plugin (optional)	✓ Native
EMAIL	✓ Regex	✓ Native
PHONE	✓ Regex (limited locales)	✓ Native
ADDRESS	Partial (spaCy GPE/LOC)	✓ Native
SSN	US-only regex plugin	✓ Native
CREDIT_CARD	✓ Regex	✓ Native
DATE_OF_BIRTH	No first-class support	✓ Native
IP_ADDRESS	Via custom detector	✓ Native
URL	✓ Regex	✓ Native
Custom regex patterns	✓ Subclass Detector	Not currently supported

One practical gap in Scrubadub: it does not expose an entities_found map by default — you get back the scrubbed text but not a structured list of what was replaced and with what token. PrivacyFilter returns entities_found: [{type, original, replacement}] in every response, which makes downstream pseudonymization (re-inserting original values into an LLM response) straightforward.

Accuracy in practice — where each tool struggles

Where Scrubadub struggles:

Informal or non-English names (spaCy en_core_web_sm has high miss rate on non-Anglo names)
Phone numbers in local or non-standard formats (e.g., Italian mobile without country code)
Contextual PII — a user saying "my sister lives on Maple Street" will not trigger an address detector if no house number is present
False positives on company names being tagged as person names

Where PrivacyFilter struggles:

Highly domain-specific identifiers (e.g., internal employee IDs, proprietary system codes) — the model has no schema for these unless they look like structured patterns
Very long documents: each credit covers up to 10,000 characters; larger documents need to be chunked
Situations where the model must make a judgment call about borderline cases — e.g., is "the doctor" in a medical note a PII reference? The model may vary

Practical rule of thumb: If your text is structured (CSV rows, form fields, log lines with a known schema), Scrubadub's regex detectors are fast, predictable, and sufficient. If your text is unstructured prose — support tickets, chat logs, free-form notes — the context-aware detection in PrivacyFilter will catch meaningfully more PII.

Integration: Scrubadub pip install vs PrivacyFilter REST API

Scrubadub installation and basic usage:

pip install scrubadub scrubadub-spacy spacy
python -m spacy download en_core_web_trf  # larger, more accurate model

import scrubadub
import scrubadub_spacy

scrubber = scrubadub.Scrubber()
scrubber.add_detector(scrubadub_spacy.detectors.SpacyEntityDetector)

text = "Hi, I'm Maria Rossi, you can reach me at maria@example.com or +39 333 1234567."
clean = scrubber.clean(text)
print(clean)
# "Hi, I'm , you can reach me at  or ."

You get back a scrubbed string. There is no built-in structured entity map — if you need one you have to cross-reference the original and scrubbed strings yourself, or use the lower-level scrubber.filth_from_text() API to inspect detected Filth objects before replacement.

PrivacyFilter REST API:

import httpx

LICENSE_KEY = "your-uuid-here"  # get at privacyfilter.run

text = "Hi, I'm Maria Rossi, you can reach me at maria@example.com or +39 333 1234567."

response = httpx.post(
    "https://privacyfilter.run/api/redact",
    json={"text": text, "license_key": LICENSE_KEY, "mode": "replace"},
    timeout=15,
).raise_for_status().json()

print(response["redacted_text"])
# "Hi, I'm [PERSON_1], you can reach me at [EMAIL_2] or [PHONE_3]."

for entity in response["entities_found"]:
    print(entity)  # {"type": "NAME", "original": "Maria Rossi", "replacement": "[PERSON_1]"}

No model download, no spaCy dependency, no GPU requirement. The trade-off is a network round-trip (~1–2 seconds) and the fact that text passes through external infrastructure (see the GDPR section below).

Using PrivacyFilter from the command line (curl):

curl -s -X POST https://privacyfilter.run/api/redact \
  -H "Content-Type: application/json" \
  -d '{"text": "Call John at 555-0100.", "license_key": "your-uuid", "mode": "replace"}' \
  | python3 -m json.tool

Offline vs hosted: data residency and GDPR implications

Scrubadub processes everything locally. Text never leaves your machine or your infrastructure. This makes it the unambiguous choice when:

You are processing data subject to strict data residency requirements (e.g., health records under HIPAA, EU financial data that must remain on-premises)
Your organisation's security policy prohibits sending any text to third-party APIs
You are operating in an air-gapped environment

PrivacyFilter sends text to the OpenAI Privacy Filter API for processing. PrivacyFilter's own servers do not store input text — only rate-limit metadata is written to SQLite. However, the text does pass through OpenAI's infrastructure, which means OpenAI's Data Processing Agreement (DPA) applies. For most GDPR use cases this is acceptable: OpenAI offers a DPA for API customers and does not use API data to train models by default. But you need to be aware of and document this transfer, especially for sensitive categories of personal data.

GDPR note: PrivacyFilter does not store your input text server-side. Processing occurs via OpenAI Privacy Filter API under OpenAI's DPA. For EU data: the transfer is covered by standard contractual clauses in OpenAI's DPA. See /privacy for the full data flow.

For many teams, the practical question is: are you sending text to an LLM anyway? If you are pre-processing support tickets before passing them to GPT-4o, the data is already going to OpenAI — using PrivacyFilter to strip PII first is strictly privacy-improving, not privacy-reducing.

Pricing and cost at scale

Tier	Scrubadub	PrivacyFilter
Free	Unlimited (self-hosted)	3 redactions/day, up to 2,000 chars each
Low volume	Infra + maintenance costs	$9 for 50 credits (each = up to 10,000 chars)
Unlimited	Infra + maintenance costs	$19/month, unlimited, batch upload included
Enterprise	Self-managed (free licence)	Contact — not the primary target

Running Scrubadub is free in licence terms, but not in total cost. A production deployment needs: a Python environment with spaCy models (the transformer model is ~500 MB), enough CPU/GPU to run inference at your required throughput, and ongoing maintenance when spaCy or scrubadub releases breaking changes. For a team processing thousands of documents per day, this infrastructure cost is easily justified. For a team processing dozens, the PrivacyFilter $19/month plan is almost certainly cheaper than the engineering time to maintain a self-hosted stack.

At the $9 credit pack level: 50 credits × 10,000 chars = 500,000 characters of text, which is roughly 75,000 words or around 300 typical support tickets. For occasional batch jobs, this is very cost-effective.

When to use Scrubadub

Your data cannot leave your infrastructure (HIPAA, financial services, defence)
You are processing high volumes of structured text with predictable PII patterns (log files, form submissions)
You need custom detectors for organisation-specific identifiers (employee IDs, internal codes)
You are already running a Python NLP stack and adding another dependency is no burden
Latency is a hard constraint and you cannot afford a network round-trip per document

When to use PrivacyFilter

You are scrubbing unstructured prose — support tickets, chat logs, free-form notes — where context-aware detection matters
You need a clean JSON entity map to perform pseudonymization (re-insert original values into LLM output)
You want to pre-process text before sending it to an LLM like GPT-4o or Claude — PrivacyFilter is purpose-built for this pipeline
You do not want to manage spaCy model versions, GPU memory, or spaCy compatibility with other packages
You need a web UI for non-developer team members (HR, legal, compliance) to do ad-hoc redactions
Your volume is low to medium and the $19/month unlimited plan is cheaper than running your own infra

FAQ

Is Scrubadub still actively maintained?

Scrubadub receives periodic updates on GitHub, but it is not under heavy active development as of 2026. The core library is stable for its supported entity types, but new PII categories or model-based detection are unlikely to be added quickly. For active feature development, consider Presidio or a hosted API.

Does Scrubadub support SSNs, credit card numbers, and dates of birth?

Scrubadub supports credit card numbers and email addresses out of the box via regex. SSN detection is available as a locale-specific detector. Dates of birth are not reliably extracted as a first-class entity type — you would need to add a custom detector.

Can PrivacyFilter be used offline or self-hosted?

No. PrivacyFilter is a hosted API backed by OpenAI Privacy Filter. Text is processed via the API and is not stored on PrivacyFilter's servers, but it does pass through OpenAI infrastructure. If you require fully offline or on-premises processing, Scrubadub or Microsoft Presidio are better fits.

Which tool is better for redacting PII before sending text to an LLM?

PrivacyFilter is purpose-built for this use case. Its context-aware detection (powered by OpenAI Privacy Filter API) handles ambiguous entities that regex-based tools miss — for example, a name mentioned only once in a support ticket without a standard prefix. The JSON entity map it returns also makes it easy to re-insert values into the LLM's output.

What does PrivacyFilter cost compared to running Scrubadub yourself?

Scrubadub is free to run but has infrastructure and maintenance costs. PrivacyFilter offers a free tier (3 redactions/day, up to 2,000 characters per document), a $9 credit pack (50 credits, each covering documents up to 10,000 characters), and an unlimited plan at $19/month. For low-to-medium volume pipelines the hosted cost is often lower than maintaining a self-hosted NLP stack.

Try PrivacyFilter free — no account required.
Paste any text and get PII redacted in under 2 seconds, with a full entity map. Free tier: 3 redactions/day up to 2,000 characters. Open the tool →

Scrubadub vs PrivacyFilter: which Python PII tool is right for your pipeline?

What each tool does — 30-second summary

Detection approach: regex + NLP vs OpenAI Privacy Filter API

Supported entity types compared

Accuracy in practice — where each tool struggles

Integration: Scrubadub pip install vs PrivacyFilter REST API

Offline vs hosted: data residency and GDPR implications

Pricing and cost at scale

When to use Scrubadub

When to use PrivacyFilter

FAQ

See also