← Blog · April 30, 2026 · 8 min read

How to scrub PII from logs — DevOps guide

TL;DR

PII leaks into logs via error messages, request bodies, and user-generated content. Scrub it at the logging layer (before write) not after. Fastest approach: add a log filter in your logging config. For unstructured log lines already stored: batch-process with the PrivacyFilter API. Regex catches emails and IPs; you need AI for names and addresses.

Application logs are one of the most common places personal data ends up by accident. A user submits a form with their name and email, the backend throws a validation error, and the full request body gets logged — including every piece of PII in it. Multiply that by millions of requests, and your Elasticsearch cluster or S3 log bucket becomes a GDPR liability.

This guide covers: where PII hides in logs, how to prevent it from being written in the first place, and how to scrub existing logs.

Where PII typically appears in logs

HTTP request bodies — form submissions, JSON payloads, API requests containing user data
Error messages and stack traces — exception messages that echo user input or database values
Query parameters — search terms, filter values, user IDs in URLs
HTTP headers — Authorization tokens, X-User-Email, forwarded IP addresses
Database query logs — slow query logs, audit logs showing interpolated values
Message queue payloads — Kafka, RabbitMQ, SQS messages dumped to logs
CI/CD logs — test fixtures, seed data, credentials in environment variable dumps

Real example: A Django app logs ValidationError at /signup/ {'email': ['john.doe@acme.com already exists']}. That email address is now in your log aggregator, retained for 90 days, accessible to your entire DevOps team, and possibly exported to a third-party monitoring vendor — without any GDPR basis.

Strategy 1: Prevent PII from being logged (best approach)

Scrubbing PII from logs after the fact is playing catch-up. The best approach is to intercept log messages before they're written and strip sensitive data at the logging layer.

Python: logging filter

import logging
import re

EMAIL_RE    = re.compile(r'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b')
PHONE_RE    = re.compile(r'\+?\d[\d\s\-\(\)]{7,}\d')
CREDITCARD_RE = re.compile(r'\b(?:\d[ \-]?){13,16}\b')

class PIIScrubFilter(logging.Filter):
    def filter(self, record):
        msg = str(record.getMessage())
        msg = EMAIL_RE.sub('[EMAIL_REDACTED]', msg)
        msg = PHONE_RE.sub('[PHONE_REDACTED]', msg)
        msg = CREDITCARD_RE.sub('[CC_REDACTED]', msg)
        record.msg = msg
        record.args = ()
        return True

# Add to your root logger
logging.getLogger().addFilter(PIIScrubFilter())

For request body logging in Django/FastAPI, override the middleware to strip fields by key name before logging:

# FastAPI middleware — strip sensitive keys from logged request bodies
import json
from fastapi import Request

SENSITIVE_KEYS = {"email", "phone", "password", "ssn", "card_number", "name", "address"}

async def log_request(request: Request):
    body = await request.body()
    try:
        data = json.loads(body)
        scrubbed = {k: "[REDACTED]" if k.lower() in SENSITIVE_KEYS else v
                    for k, v in data.items()}
        logger.info(f"POST {request.url.path} body={scrubbed}")
    except Exception:
        logger.info(f"POST {request.url.path} body=[non-JSON]")

Strategy 2: Scrub existing logs in bulk

If you already have stored log files containing PII, you need to process them in bulk. For structured log files (JSONL, CSV), use regex on known fields. For unstructured log lines with free-form text, use the PrivacyFilter API which handles contextual PII like names that regex misses.

Scrub a log file with the PrivacyFilter API

import httpx, sys

LICENSE_KEY = "your-uuid-here"

def scrub_log_file(input_path: str, output_path: str, batch_size: int = 20):
    with open(input_path) as f:
        lines = f.readlines()

    scrubbed = []
    for i in range(0, len(lines), batch_size):
        batch = lines[i:i + batch_size]
        docs = [{"id": str(i + j), "text": line.strip()} for j, line in enumerate(batch)]
        resp = httpx.post(
            "https://privacyfilter.run/api/redact/batch",
            json={"documents": docs, "license_key": LICENSE_KEY, "mode": "mask"},
            timeout=60,
        ).raise_for_status().json()
        for result in resp["results"]:
            scrubbed.append(result["redacted_text"] + "\n")

    with open(output_path, "w") as f:
        f.writelines(scrubbed)
    print(f"Scrubbed {len(lines)} lines → {output_path}")

# Usage: python scrub_logs.py app.log app.log.scrubbed
scrub_log_file(sys.argv[1], sys.argv[2])

Use mode=mask (not replace) for log scrubbing — you want irreversible redaction, not traceable pseudonyms. Masking produces ████ which makes clear the data was removed.

Log aggregator-specific approaches

Elasticsearch / OpenSearch

Use an ingest pipeline with a gsub processor to replace email patterns before indexing:

PUT _ingest/pipeline/pii-scrub
{
  "processors": [
    {
      "gsub": {
        "field": "message",
        "pattern": "\\b[A-Za-z0-9._%+\\-]+@[A-Za-z0-9.\\-]+\\.[A-Za-z]{2,}\\b",
        "replacement": "[EMAIL_REDACTED]"
      }
    }
  ]
}

Datadog

Use Sensitive Data Scanner in Datadog settings. It supports built-in rules for email, SSN, credit cards, and custom regex. Apply it at the log pipeline level so PII never reaches the indexed logs.

Splunk

Use the SEDCMD setting in props.conf to run a sed-like replacement on log data before indexing:

[source::/var/log/app/*.log]
SEDCMD-redact_email = s/[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}/[EMAIL_REDACTED]/g

Logstash / Fluent Bit

# Logstash: mutate filter with gsub
filter {
  mutate {
    gsub => [
      "message", "[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}", "[EMAIL_REDACTED]",
      "message", "\b\d{3}-\d{2}-\d{4}\b", "[SSN_REDACTED]"
    ]
  }
}

GDPR and log retention requirements

Under GDPR, logs containing personal data must:

Have a documented legal basis for processing (usually "legitimate interest" for security/debugging)
Be retained only as long as necessary (most supervisory authorities recommend 30–90 days for operational logs)
Be accessible only to personnel who need them
Be included in your Records of Processing Activities (RoPA)

Scrubbing PII from logs reduces your exposure and can support arguments that logs no longer contain personal data — simplifying retention and access controls.

Regex coverage for common log PII

The following regex patterns cover the most common PII types in logs. They are not exhaustive — for unstructured text with names or addresses, you need AI-powered detection.

import re

# Email addresses
EMAIL = re.compile(r'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b')

# IPv4 (often not PII, but may be under GDPR in some jurisdictions)
IPV4 = re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b')

# IPv6
IPV6 = re.compile(r'\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b')

# SSN (US)
SSN = re.compile(r'\b\d{3}[-\s]?\d{2}[-\s]?\d{4}\b')

# Credit cards (Luhn-valid check not included here)
CREDITCARD = re.compile(r'\b(?:\d[ \-]?){13,16}\b')

# UK National Insurance number
NI_UK = re.compile(r'\b[A-Z]{2}\d{6}[A-D]\b')

PATTERNS = [EMAIL, IPV4, SSN, CREDITCARD]

def regex_scrub(line: str) -> str:
    for pat in PATTERNS:
        line = pat.sub('[REDACTED]', line)
    return line

Need to scrub logs at scale? The PrivacyFilter batch API handles up to 20 log lines per request, with AI detection for names and addresses that regex misses.

Get API access →

How to scrub PII from logs — DevOps guide

Where PII typically appears in logs

Strategy 1: Prevent PII from being logged (best approach)

Python: logging filter

Strategy 2: Scrub existing logs in bulk

Scrub a log file with the PrivacyFilter API

Log aggregator-specific approaches

Elasticsearch / OpenSearch

Datadog

Splunk

Logstash / Fluent Bit

GDPR and log retention requirements

Regex coverage for common log PII

Keep reading