PII scrubbing
PII scrubbing ensures sensitive data is never sent to an LLM provider. Entities are detected in the prompt, replaced with deterministic placeholders, and then restored in the response before it reaches the client.
How it works
Section titled “How it works”User prompt: "My name is Alice Smith, email alice@example.com" ↓ stage 05: scrubTo LLM: "My name is <<PII_PERSON_a3f8>>, email <<PII_EMAIL_ADDRESS_b1c2>>" ↓ LLM respondsFrom LLM: "Hello <<PII_PERSON_a3f8>>! I'll contact you at <<PII_EMAIL_ADDRESS_b1c2>>" ↓ stage 09: restoreClient gets: "Hello Alice Smith! I'll contact you at alice@example.com"The placeholder <<PII_ENTITY_TYPE_hash>> is:
- Deterministic — same input value always produces the same placeholder within a request
- Reversible — the mapping is stored in request context and used to restore values in the response
- Opaque — the hash is a truncated SHA-256 of the original value; the original cannot be derived from the placeholder
Detected entities
Section titled “Detected entities”Configured in config.yaml under pii.entities:
| Entity | Examples |
|---|---|
PERSON | Alice Smith, Dr. Johnson |
EMAIL_ADDRESS | alice@example.com |
PHONE_NUMBER | +1-555-867-5309 |
CREDIT_CARD | 4111 1111 1111 1111 |
US_SSN | 123-45-6789 |
IP_ADDRESS | 192.168.1.1 |
LOCATION | 221B Baker Street, London |
Add or remove entity types in config.yaml:
pii: entities: - PERSON - EMAIL_ADDRESS - PHONE_NUMBER - CREDIT_CARD - US_SSN - IP_ADDRESS - LOCATIONScore threshold
Section titled “Score threshold”pii.score_threshold (default 0.7) controls Presidio’s minimum confidence before an entity is redacted. Lower values catch more entities but increase false positives.
pii: score_threshold: 0.7Engine
Section titled “Engine”Detection uses Microsoft Presidio with a spaCy en_core_web_lg NER backend. The spaCy model is loaded on startup (it’s ~800 MB — this is why probes have a 60-second initial delay).
Disabling
Section titled “Disabling”pii: enabled: falseThe scrubber still initialises (keeping startup time the same), but all requests pass through unchanged.
Limitations
Section titled “Limitations”- English language only (spaCy
en_core_web_lg) - Does not scrub binary data or file uploads
- Cannot restore PII if the LLM paraphrases the placeholder (e.g. “the person mentioned earlier”) rather than echoing it verbatim
- Context-dependent entities (e.g. a company name that is also a common word) may be missed at threshold 0.7