Skip to content

Health & metrics

Liveness probe. Returns 200 once the application has started.

Terminal window
curl http://localhost:8000/healthz
# {"status":"ok"}

Used by Kubernetes livenessProbe. The default initialDelaySeconds: 60 accounts for spaCy en_core_web_lg loading (~800 MB).

Readiness probe. Returns 200 only when all dependencies (database, ChromaDB) are reachable.

Terminal window
curl http://localhost:8000/readyz
# {"status":"ok"}

Used by Kubernetes readinessProbe. Traffic is not routed to a pod until this returns 200.

Prometheus text format metrics.

Terminal window
curl http://localhost:8000/metrics

Key metrics exposed:

MetricTypeLabelsDescription
relay_requests_totalCountermodel, statusTotal inference requests
relay_request_duration_secondsHistogrammodelEnd-to-end request latency
relay_tokens_totalCountermodel, typeTokens consumed (prompt/completion)
relay_rate_limit_hits_totalCounterlimit_typeRate limit rejections
relay_cache_hits_totalCounterCache hits
relay_pii_entities_totalCounterentity_typePII entities scrubbed
relay_content_policy_blocks_totalCounterContent policy rejections

Enable automatic scraping with Prometheus Operator:

values.yaml
prometheus:
serviceMonitor:
enabled: true
interval: "15s"
scrapeTimeout: "10s"
labels:
release: prometheus # match your Prometheus Operator release label
prometheus.yml
scrape_configs:
- job_name: llm-proxy
static_configs:
- targets: ["proxy.internal:8000"]
metrics_path: /metrics