Skip to content

PII scrubbing

PII scrubbing ensures sensitive data is never sent to an LLM provider. Entities are detected in the prompt, replaced with deterministic placeholders, and then restored in the response before it reaches the client.

User prompt: "My name is Alice Smith, email alice@example.com"
↓ stage 05: scrub
To LLM: "My name is <<PII_PERSON_a3f8>>, email <<PII_EMAIL_ADDRESS_b1c2>>"
↓ LLM responds
From LLM: "Hello <<PII_PERSON_a3f8>>! I'll contact you at <<PII_EMAIL_ADDRESS_b1c2>>"
↓ stage 09: restore
Client gets: "Hello Alice Smith! I'll contact you at alice@example.com"

The placeholder <<PII_ENTITY_TYPE_hash>> is:

  • Deterministic — same input value always produces the same placeholder within a request
  • Reversible — the mapping is stored in request context and used to restore values in the response
  • Opaque — the hash is a truncated SHA-256 of the original value; the original cannot be derived from the placeholder

Configured in config.yaml under pii.entities:

EntityExamples
PERSONAlice Smith, Dr. Johnson
EMAIL_ADDRESSalice@example.com
PHONE_NUMBER+1-555-867-5309
CREDIT_CARD4111 1111 1111 1111
US_SSN123-45-6789
IP_ADDRESS192.168.1.1
LOCATION221B Baker Street, London

Add or remove entity types in config.yaml:

pii:
entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- CREDIT_CARD
- US_SSN
- IP_ADDRESS
- LOCATION

pii.score_threshold (default 0.7) controls Presidio’s minimum confidence before an entity is redacted. Lower values catch more entities but increase false positives.

pii:
score_threshold: 0.7

Detection uses Microsoft Presidio with a spaCy en_core_web_lg NER backend. The spaCy model is loaded on startup (it’s ~800 MB — this is why probes have a 60-second initial delay).

pii:
enabled: false

The scrubber still initialises (keeping startup time the same), but all requests pass through unchanged.

  • English language only (spaCy en_core_web_lg)
  • Does not scrub binary data or file uploads
  • Cannot restore PII if the LLM paraphrases the placeholder (e.g. “the person mentioned earlier”) rather than echoing it verbatim
  • Context-dependent entities (e.g. a company name that is also a common word) may be missed at threshold 0.7