Skip to content

Observability

prometheus.yml
scrape_configs:
- job_name: llm-proxy
static_configs:
- targets: ["proxy.internal:8000"]
metrics_path: /metrics
scrape_interval: 15s

Kubernetes ServiceMonitor (Prometheus Operator)

Section titled “Kubernetes ServiceMonitor (Prometheus Operator)”
prometheus:
serviceMonitor:
enabled: true
interval: "15s"
scrapeTimeout: "10s"
labels:
release: prometheus # must match your Prometheus Operator's serviceMonitorSelector
MetricTypeLabels
relay_requests_totalCountermodel, status
relay_request_duration_secondsHistogrammodel
relay_tokens_totalCountermodel, type (prompt/completion)
relay_rate_limit_hits_totalCounterlimit_type
relay_cache_hits_totalCounter
relay_pii_entities_totalCounterentity_type
relay_content_policy_blocks_totalCounter
  1. Request rate (requests/sec by model)

    sum by (model) (rate(relay_requests_total[5m]))
  2. Error rate

    sum by (status) (rate(relay_requests_total{status!="200"}[5m]))
  3. Latency p50 / p95 / p99

    histogram_quantile(0.95, sum by (le) (rate(relay_request_duration_seconds_bucket[5m])))
  4. Token throughput

    sum by (type) (rate(relay_tokens_total[5m]))
  5. Rate limit hit rate

    sum by (limit_type) (rate(relay_rate_limit_hits_total[5m]))
  6. Cache hit ratio

    rate(relay_cache_hits_total[5m]) / rate(relay_requests_total[5m])
  7. PII entities scrubbed

    sum by (entity_type) (rate(relay_pii_entities_total[5m]))
groups:
- name: llm-proxy
rules:
- alert: HighErrorRate
expr: rate(relay_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High upstream error rate"
- alert: HighLatency
expr: histogram_quantile(0.95, sum by (le) (rate(relay_request_duration_seconds_bucket[5m]))) > 10
for: 10m
annotations:
summary: "p95 latency over 10s"
- alert: RateLimitSpike
expr: sum(rate(relay_rate_limit_hits_total[5m])) > 5
for: 5m
annotations:
summary: "Elevated rate limiting — check user quotas"

Enable JSON logging for log aggregation (Loki, CloudWatch, Datadog):

server:
log_level: info
# JSON format emitted automatically when LOG_FORMAT=json env var is set

Each request logs:

{
"timestamp": "2025-01-01T00:00:00Z",
"level": "info",
"request_id": "req_01j...",
"user_id": "user_01j...",
"team_id": "team_01j...",
"model": "gpt-4o",
"prompt_tokens": 142,
"completion_tokens": 87,
"latency_ms": 1240,
"cached": false,
"pii_entities_scrubbed": 2
}

Add Promtail or the Grafana Alloy agent to your cluster and configure log labels:

# promtail pipeline stage
- match:
selector: '{app="llm-proxy"}'
stages:
- json:
expressions:
model: model
user_id: user_id
- labels:
model:
user_id:

This enables log queries like {app="llm-proxy", model="gpt-4o"}.

For per-request prompt/completion tracing see Analytics & observability.

Used by Kubernetes probes:

EndpointPurposeReturns 200 when
GET /healthzLivenessApp started
GET /readyzReadinessDB and ChromaDB reachable