Observability
Prometheus
Section titled “Prometheus”Manual scrape config
Section titled “Manual scrape config”scrape_configs: - job_name: llm-proxy static_configs: - targets: ["proxy.internal:8000"] metrics_path: /metrics scrape_interval: 15sKubernetes ServiceMonitor (Prometheus Operator)
Section titled “Kubernetes ServiceMonitor (Prometheus Operator)”prometheus: serviceMonitor: enabled: true interval: "15s" scrapeTimeout: "10s" labels: release: prometheus # must match your Prometheus Operator's serviceMonitorSelectorKey metrics
Section titled “Key metrics”| Metric | Type | Labels |
|---|---|---|
relay_requests_total | Counter | model, status |
relay_request_duration_seconds | Histogram | model |
relay_tokens_total | Counter | model, type (prompt/completion) |
relay_rate_limit_hits_total | Counter | limit_type |
relay_cache_hits_total | Counter | — |
relay_pii_entities_total | Counter | entity_type |
relay_content_policy_blocks_total | Counter | — |
Grafana
Section titled “Grafana”Suggested dashboard panels
Section titled “Suggested dashboard panels”-
Request rate (requests/sec by model)
sum by (model) (rate(relay_requests_total[5m])) -
Error rate
sum by (status) (rate(relay_requests_total{status!="200"}[5m])) -
Latency p50 / p95 / p99
histogram_quantile(0.95, sum by (le) (rate(relay_request_duration_seconds_bucket[5m]))) -
Token throughput
sum by (type) (rate(relay_tokens_total[5m])) -
Rate limit hit rate
sum by (limit_type) (rate(relay_rate_limit_hits_total[5m])) -
Cache hit ratio
rate(relay_cache_hits_total[5m]) / rate(relay_requests_total[5m]) -
PII entities scrubbed
sum by (entity_type) (rate(relay_pii_entities_total[5m]))
Recommended alerts
Section titled “Recommended alerts”groups: - name: llm-proxy rules: - alert: HighErrorRate expr: rate(relay_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m annotations: summary: "High upstream error rate"
- alert: HighLatency expr: histogram_quantile(0.95, sum by (le) (rate(relay_request_duration_seconds_bucket[5m]))) > 10 for: 10m annotations: summary: "p95 latency over 10s"
- alert: RateLimitSpike expr: sum(rate(relay_rate_limit_hits_total[5m])) > 5 for: 5m annotations: summary: "Elevated rate limiting — check user quotas"Structured logging
Section titled “Structured logging”Enable JSON logging for log aggregation (Loki, CloudWatch, Datadog):
server: log_level: info # JSON format emitted automatically when LOG_FORMAT=json env var is setEach request logs:
{ "timestamp": "2025-01-01T00:00:00Z", "level": "info", "request_id": "req_01j...", "user_id": "user_01j...", "team_id": "team_01j...", "model": "gpt-4o", "prompt_tokens": 142, "completion_tokens": 87, "latency_ms": 1240, "cached": false, "pii_entities_scrubbed": 2}Loki (Kubernetes)
Section titled “Loki (Kubernetes)”Add Promtail or the Grafana Alloy agent to your cluster and configure log labels:
# promtail pipeline stage- match: selector: '{app="llm-proxy"}' stages: - json: expressions: model: model user_id: user_id - labels: model: user_id:This enables log queries like {app="llm-proxy", model="gpt-4o"}.
Langfuse traces
Section titled “Langfuse traces”For per-request prompt/completion tracing see Analytics & observability.
Health endpoints
Section titled “Health endpoints”Used by Kubernetes probes:
| Endpoint | Purpose | Returns 200 when |
|---|---|---|
GET /healthz | Liveness | App started |
GET /readyz | Readiness | DB and ChromaDB reachable |