Content policy
Content policy runs at stage 02 — before rate limiting, PII scrubbing, and the LLM call. It catches obvious attacks immediately so you never pay for bad tokens.
Blocked patterns
Section titled “Blocked patterns”A case-insensitive substring match is applied to the full concatenated prompt text (all messages joined). If any pattern is found, the request is rejected.
Default patterns:
content_policy: blocked_patterns: - "ignore previous instructions" - "ignore all previous" - "jailbreak"Add your own:
content_policy: blocked_patterns: - "ignore previous instructions" - "ignore all previous" - "jailbreak" - "DAN mode" - "act as if you have no restrictions" - "pretend you are an AI without guidelines" - "hypothetically speaking, you could"Helm:
config: contentPolicy: blockedPatterns: - "ignore previous instructions" - "jailbreak" - "DAN mode"Max input tokens
Section titled “Max input tokens”Large prompts can exhaust your budget quickly and are often a sign of context-stuffing attacks. Set a hard ceiling:
content_policy: max_input_tokens: 32000Tokens are counted with tiktoken before any LLM call. Requests over the limit are rejected with 400.
Response shape on block
Section titled “Response shape on block”HTTP/1.1 400 Bad RequestContent-Type: application/json
{ "error": { "type": "content_policy_violation", "message": "Request blocked by content policy.", "code": 400 }}The response intentionally does not reveal which pattern matched, to avoid helping attackers craft a bypass.
Disabling
Section titled “Disabling”content_policy: enabled: falseChoosing good patterns
Section titled “Choosing good patterns”Effective patterns are:
- Specific enough not to trigger on legitimate use (avoid generic words like “ignore” alone)
- Phrase-level — attackers can work around single-word blocks trivially
- Regularly reviewed — the threat landscape evolves; add patterns when you see new attack variants in your logs
For more sophisticated semantic-level content moderation, consider adding a Presidio ContentModeration check or a separate LLM-as-judge classifier alongside the pattern list.