Skip to content

Content policy

Content policy runs at stage 02 — before rate limiting, PII scrubbing, and the LLM call. It catches obvious attacks immediately so you never pay for bad tokens.

A case-insensitive substring match is applied to the full concatenated prompt text (all messages joined). If any pattern is found, the request is rejected.

Default patterns:

content_policy:
blocked_patterns:
- "ignore previous instructions"
- "ignore all previous"
- "jailbreak"

Add your own:

content_policy:
blocked_patterns:
- "ignore previous instructions"
- "ignore all previous"
- "jailbreak"
- "DAN mode"
- "act as if you have no restrictions"
- "pretend you are an AI without guidelines"
- "hypothetically speaking, you could"

Helm:

config:
contentPolicy:
blockedPatterns:
- "ignore previous instructions"
- "jailbreak"
- "DAN mode"

Large prompts can exhaust your budget quickly and are often a sign of context-stuffing attacks. Set a hard ceiling:

content_policy:
max_input_tokens: 32000

Tokens are counted with tiktoken before any LLM call. Requests over the limit are rejected with 400.

HTTP/1.1 400 Bad Request
Content-Type: application/json
{
"error": {
"type": "content_policy_violation",
"message": "Request blocked by content policy.",
"code": 400
}
}

The response intentionally does not reveal which pattern matched, to avoid helping attackers craft a bypass.

content_policy:
enabled: false

Effective patterns are:

  • Specific enough not to trigger on legitimate use (avoid generic words like “ignore” alone)
  • Phrase-level — attackers can work around single-word blocks trivially
  • Regularly reviewed — the threat landscape evolves; add patterns when you see new attack variants in your logs

For more sophisticated semantic-level content moderation, consider adding a Presidio ContentModeration check or a separate LLM-as-judge classifier alongside the pattern list.