How to scrub PII from logs — DevOps guide
PII leaks into logs via error messages, request bodies, and user-generated content. Scrub it at the logging layer (before write) not after. Fastest approach: add a log filter in your logging config. For unstructured log lines already stored: batch-process with the PrivacyFilter API. Regex catches emails and IPs; you need AI for names and addresses.
Application logs are one of the most common places personal data ends up by accident. A user submits a form with their name and email, the backend throws a validation error, and the full request body gets logged — including every piece of PII in it. Multiply that by millions of requests, and your Elasticsearch cluster or S3 log bucket becomes a GDPR liability.
This guide covers: where PII hides in logs, how to prevent it from being written in the first place, and how to scrub existing logs.
Where PII typically appears in logs
- HTTP request bodies — form submissions, JSON payloads, API requests containing user data
- Error messages and stack traces — exception messages that echo user input or database values
- Query parameters — search terms, filter values, user IDs in URLs
- HTTP headers —
Authorizationtokens,X-User-Email, forwarded IP addresses - Database query logs — slow query logs, audit logs showing interpolated values
- Message queue payloads — Kafka, RabbitMQ, SQS messages dumped to logs
- CI/CD logs — test fixtures, seed data, credentials in environment variable dumps
Real example: A Django app logs ValidationError at /signup/ {'email': ['john.doe@acme.com already exists']}. That email address is now in your log aggregator, retained for 90 days, accessible to your entire DevOps team, and possibly exported to a third-party monitoring vendor — without any GDPR basis.
Strategy 1: Prevent PII from being logged (best approach)
Scrubbing PII from logs after the fact is playing catch-up. The best approach is to intercept log messages before they're written and strip sensitive data at the logging layer.
Python: logging filter
import logging
import re
EMAIL_RE = re.compile(r'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b')
PHONE_RE = re.compile(r'\+?\d[\d\s\-\(\)]{7,}\d')
CREDITCARD_RE = re.compile(r'\b(?:\d[ \-]?){13,16}\b')
class PIIScrubFilter(logging.Filter):
def filter(self, record):
msg = str(record.getMessage())
msg = EMAIL_RE.sub('[EMAIL_REDACTED]', msg)
msg = PHONE_RE.sub('[PHONE_REDACTED]', msg)
msg = CREDITCARD_RE.sub('[CC_REDACTED]', msg)
record.msg = msg
record.args = ()
return True
# Add to your root logger
logging.getLogger().addFilter(PIIScrubFilter())
For request body logging in Django/FastAPI, override the middleware to strip fields by key name before logging:
# FastAPI middleware — strip sensitive keys from logged request bodies
import json
from fastapi import Request
SENSITIVE_KEYS = {"email", "phone", "password", "ssn", "card_number", "name", "address"}
async def log_request(request: Request):
body = await request.body()
try:
data = json.loads(body)
scrubbed = {k: "[REDACTED]" if k.lower() in SENSITIVE_KEYS else v
for k, v in data.items()}
logger.info(f"POST {request.url.path} body={scrubbed}")
except Exception:
logger.info(f"POST {request.url.path} body=[non-JSON]")
Strategy 2: Scrub existing logs in bulk
If you already have stored log files containing PII, you need to process them in bulk. For structured log files (JSONL, CSV), use regex on known fields. For unstructured log lines with free-form text, use the PrivacyFilter API which handles contextual PII like names that regex misses.
Scrub a log file with the PrivacyFilter API
import httpx, sys
LICENSE_KEY = "your-uuid-here"
def scrub_log_file(input_path: str, output_path: str, batch_size: int = 20):
with open(input_path) as f:
lines = f.readlines()
scrubbed = []
for i in range(0, len(lines), batch_size):
batch = lines[i:i + batch_size]
docs = [{"id": str(i + j), "text": line.strip()} for j, line in enumerate(batch)]
resp = httpx.post(
"https://privacyfilter.run/api/redact/batch",
json={"documents": docs, "license_key": LICENSE_KEY, "mode": "mask"},
timeout=60,
).raise_for_status().json()
for result in resp["results"]:
scrubbed.append(result["redacted_text"] + "\n")
with open(output_path, "w") as f:
f.writelines(scrubbed)
print(f"Scrubbed {len(lines)} lines → {output_path}")
# Usage: python scrub_logs.py app.log app.log.scrubbed
scrub_log_file(sys.argv[1], sys.argv[2])
Use mode=mask (not replace) for log scrubbing — you want irreversible redaction, not traceable pseudonyms. Masking produces ████ which makes clear the data was removed.
Log aggregator-specific approaches
Elasticsearch / OpenSearch
Use an ingest pipeline with a gsub processor to replace email patterns before indexing:
PUT _ingest/pipeline/pii-scrub
{
"processors": [
{
"gsub": {
"field": "message",
"pattern": "\\b[A-Za-z0-9._%+\\-]+@[A-Za-z0-9.\\-]+\\.[A-Za-z]{2,}\\b",
"replacement": "[EMAIL_REDACTED]"
}
}
]
}
Datadog
Use Sensitive Data Scanner in Datadog settings. It supports built-in rules for email, SSN, credit cards, and custom regex. Apply it at the log pipeline level so PII never reaches the indexed logs.
Splunk
Use the SEDCMD setting in props.conf to run a sed-like replacement on log data before indexing:
[source::/var/log/app/*.log]
SEDCMD-redact_email = s/[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}/[EMAIL_REDACTED]/g
Logstash / Fluent Bit
# Logstash: mutate filter with gsub
filter {
mutate {
gsub => [
"message", "[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}", "[EMAIL_REDACTED]",
"message", "\b\d{3}-\d{2}-\d{4}\b", "[SSN_REDACTED]"
]
}
}
GDPR and log retention requirements
Under GDPR, logs containing personal data must:
- Have a documented legal basis for processing (usually "legitimate interest" for security/debugging)
- Be retained only as long as necessary (most supervisory authorities recommend 30–90 days for operational logs)
- Be accessible only to personnel who need them
- Be included in your Records of Processing Activities (RoPA)
Scrubbing PII from logs reduces your exposure and can support arguments that logs no longer contain personal data — simplifying retention and access controls.
Regex coverage for common log PII
The following regex patterns cover the most common PII types in logs. They are not exhaustive — for unstructured text with names or addresses, you need AI-powered detection.
import re
# Email addresses
EMAIL = re.compile(r'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b')
# IPv4 (often not PII, but may be under GDPR in some jurisdictions)
IPV4 = re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b')
# IPv6
IPV6 = re.compile(r'\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b')
# SSN (US)
SSN = re.compile(r'\b\d{3}[-\s]?\d{2}[-\s]?\d{4}\b')
# Credit cards (Luhn-valid check not included here)
CREDITCARD = re.compile(r'\b(?:\d[ \-]?){13,16}\b')
# UK National Insurance number
NI_UK = re.compile(r'\b[A-Z]{2}\d{6}[A-D]\b')
PATTERNS = [EMAIL, IPV4, SSN, CREDITCARD]
def regex_scrub(line: str) -> str:
for pat in PATTERNS:
line = pat.sub('[REDACTED]', line)
return line
Need to scrub logs at scale? The PrivacyFilter batch API handles up to 20 log lines per request, with AI detection for names and addresses that regex misses.