Health & metrics
GET /healthz
Section titled “GET /healthz”Liveness probe. Returns 200 once the application has started.
curl http://localhost:8000/healthz# {"status":"ok"}Used by Kubernetes livenessProbe. The default initialDelaySeconds: 60 accounts for spaCy en_core_web_lg loading (~800 MB).
GET /readyz
Section titled “GET /readyz”Readiness probe. Returns 200 only when all dependencies (database, ChromaDB) are reachable.
curl http://localhost:8000/readyz# {"status":"ok"}Used by Kubernetes readinessProbe. Traffic is not routed to a pod until this returns 200.
GET /metrics
Section titled “GET /metrics”Prometheus text format metrics.
curl http://localhost:8000/metricsKey metrics exposed:
| Metric | Type | Labels | Description |
|---|---|---|---|
relay_requests_total | Counter | model, status | Total inference requests |
relay_request_duration_seconds | Histogram | model | End-to-end request latency |
relay_tokens_total | Counter | model, type | Tokens consumed (prompt/completion) |
relay_rate_limit_hits_total | Counter | limit_type | Rate limit rejections |
relay_cache_hits_total | Counter | — | Cache hits |
relay_pii_entities_total | Counter | entity_type | PII entities scrubbed |
relay_content_policy_blocks_total | Counter | — | Content policy rejections |
Kubernetes ServiceMonitor
Section titled “Kubernetes ServiceMonitor”Enable automatic scraping with Prometheus Operator:
prometheus: serviceMonitor: enabled: true interval: "15s" scrapeTimeout: "10s" labels: release: prometheus # match your Prometheus Operator release labelManual scrape config
Section titled “Manual scrape config”scrape_configs: - job_name: llm-proxy static_configs: - targets: ["proxy.internal:8000"] metrics_path: /metrics