Request pipeline
Every request to /v1/chat/completions or /v1/messages passes through nine stages in order. Each stage can independently reject, transform, or short-circuit the request.
Stage overview
Section titled “Stage overview”Client request │ ▼┌─────────────────────────┐│ 01 Authentication │ Identify user, resolve team└─────────────┬───────────┘ │ ▼┌─────────────────────────┐│ 02 Content Policy │ Block patterns, check token count└─────────────┬───────────┘ │ ▼┌─────────────────────────┐│ 03 Token Count │ Count prompt tokens, enforce model limit└─────────────┬───────────┘ │ ▼┌─────────────────────────┐│ 04 Rate Limiting │ req/min, tokens/min, tokens/day└─────────────┬───────────┘ │ ▼┌─────────────────────────┐│ 05 PII Scrubbing │ Detect & replace sensitive entities└─────────────┬───────────┘ │ ▼┌─────────────────────────┐│ 06 RAG Context │ Semantic search, inject chunks└─────────────┬───────────┘ │ ▼┌─────────────────────────┐│ 07 Cache Lookup │ Return cached response if hit└─────────────┬───────────┘ │ (miss) ▼┌─────────────────────────┐│ 08 LLM Call │ Route via LiteLLM, fallback models└─────────────┬───────────┘ │ ▼┌─────────────────────────┐│ 09 Metrics & Usage │ Record to DB, restore PII, emit metrics└─────────────────────────┘ │ ▼ Client responseStage details
Section titled “Stage details”01 — Authentication
Section titled “01 — Authentication”- Extracts API key from
Authorization: Bearerorx-api-keyheader - Looks up key hash in the database (SHA-256 comparison)
- Resolves associated user and team
- Attaches user/team context to the request for downstream stages
- Rejects with 401 if key is missing, unknown, or revoked
02 — Content Policy
Section titled “02 — Content Policy”- Checks the concatenated prompt text against
content_policy.blocked_patterns(case-insensitive literal match) - Rejects with 400 (
content_policy_violation) if any pattern matches - Runs before token counting to fail fast on obvious attacks
- Disabled by setting
content_policy.enabled: false
03 — Token Count
Section titled “03 — Token Count”- Counts prompt tokens using
tiktoken(model-appropriate encoding) - Enforces
content_policy.max_input_tokens(default 32 000) - Stores the count for stage 04 (rate limiting deducts from buckets)
- Rejects with 400 if the prompt exceeds the token limit
04 — Rate Limiting
Section titled “04 — Rate Limiting”Three token buckets checked in order, any can reject:
- User req/min —
rate_limiting.defaults.requests_per_minute - User tokens/min —
rate_limiting.defaults.tokens_per_minute - User tokens/day —
rate_limiting.defaults.tokens_per_day - Team tokens/min — team’s
tpm_limit(if team has override)
Rejects with 429 and Retry-After header on any overflow.
See Rate limiting for bucket mechanics and Redis backend.
05 — PII Scrubbing
Section titled “05 — PII Scrubbing”- Runs Presidio
AnalyzerEngineacross all message content - Detected entities are replaced with deterministic placeholders:
<<PII_EMAIL_ADDRESS_a3f8c1d0>> - The placeholder→original mapping is stored in request context for stage 09
- Disabled by setting
pii.enabled: false
See PII scrubbing.
06 — RAG Context
Section titled “06 — RAG Context”- Embeds the last user message with
all-MiniLM-L6-v2 - Queries ChromaDB for top-k chunks above
score_threshold - Injects retrieved chunks as a prefix in the system message
- No-op if ChromaDB is empty or if
rag.enabled: false
See RAG integration.
07 — Cache Lookup
Section titled “07 — Cache Lookup”- Hashes the (normalized messages + model) to a cache key
- Returns the cached response immediately on hit — stages 08–09 are skipped
- Disabled by setting
cache.enabled: false(default)
08 — LLM Call
Section titled “08 — LLM Call”- Routes to the provider via LiteLLM based on the model name prefix
- On provider error (5xx, timeout): tries
fallback_modelsin order - Supports streaming (SSE) pass-through for both OpenAI and Anthropic formats
09 — Metrics & Usage
Section titled “09 — Metrics & Usage”- Counts completion tokens from the response
- Writes a
UsageRecordto PostgreSQL (user, team, model, prompt tokens, completion tokens, cost, latency) - Restores PII placeholders in the response content (reverse of stage 05)
- Increments Prometheus counters
- Stores response in cache if
cache.enabled: true(non-streaming only)
Skipping stages
Section titled “Skipping stages”Each non-authentication stage can be disabled in config.yaml:
rag: enabled: falsepii: enabled: falsecontent_policy: enabled: falserate_limiting: enabled: falsecache: enabled: falseAuthentication and metrics recording cannot be disabled.