Content policy

Content policy runs at stage 02 — before rate limiting, PII scrubbing, and the LLM call. It catches obvious attacks immediately so you never pay for bad tokens.

Blocked patterns

A case-insensitive substring match is applied to the full concatenated prompt text (all messages joined). If any pattern is found, the request is rejected.

Default patterns:

content_policy:
  blocked_patterns:
    - "ignore previous instructions"
    - "ignore all previous"
    - "jailbreak"

Add your own:

content_policy:
  blocked_patterns:
    - "ignore previous instructions"
    - "ignore all previous"
    - "jailbreak"
    - "DAN mode"
    - "act as if you have no restrictions"
    - "pretend you are an AI without guidelines"
    - "hypothetically speaking, you could"

Helm:

config:
  contentPolicy:
    blockedPatterns:
      - "ignore previous instructions"
      - "jailbreak"
      - "DAN mode"

Max input tokens

Large prompts can exhaust your budget quickly and are often a sign of context-stuffing attacks. Set a hard ceiling:

content_policy:
  max_input_tokens: 32000

Tokens are counted with tiktoken before any LLM call. Requests over the limit are rejected with 400.

Response shape on block

HTTP/1.1 400 Bad Request
Content-Type: application/json

{
  "error": {
    "type": "content_policy_violation",
    "message": "Request blocked by content policy.",
    "code": 400
  }
}

The response intentionally does not reveal which pattern matched, to avoid helping attackers craft a bypass.

Disabling

content_policy:
  enabled: false

Choosing good patterns

Effective patterns are:

Specific enough not to trigger on legitimate use (avoid generic words like “ignore” alone)
Phrase-level — attackers can work around single-word blocks trivially
Regularly reviewed — the threat landscape evolves; add patterns when you see new attack variants in your logs

For more sophisticated semantic-level content moderation, consider adding a Presidio ContentModeration check or a separate LLM-as-judge classifier alongside the pattern list.