Scaling & HA
Single-replica (default)
Section titled “Single-replica (default)”The default replicaCount: 1 with config.workers: 4 provides concurrency via multiple uvicorn worker processes. This is the recommended starting point.
For most teams, a single replica with a Redis backend for rate limiting and caching is sufficient up to several hundred requests per minute.
Multi-replica
Section titled “Multi-replica”RWX storage options
Section titled “RWX storage options”| Cloud | Solution |
|---|---|
| AWS | Amazon EFS (with EFS CSI driver) |
| Azure | Azure Files |
| GCP | Filestore (NFS) |
| On-prem | NFS, Ceph CephFS |
persistence: chroma: accessMode: ReadWriteMany storageClass: efs-sc # your RWX storage class
knowledgeBase: accessMode: ReadWriteMany storageClass: efs-scRedis required for shared state
Section titled “Redis required for shared state”With multiple replicas, the rate limiter and cache must use Redis — otherwise each pod enforces limits independently and caches independently:
redis: enabled: true
replicaCount: 3HPA (Horizontal Pod Autoscaler)
Section titled “HPA (Horizontal Pod Autoscaler)”autoscaling: enabled: true minReplicas: 2 maxReplicas: 8 targetCPUUtilizationPercentage: 70Scaling is driven by CPU utilisation. Memory-based scaling is less useful here because the spaCy model is loaded once at startup and contributes a constant baseline (~800 MB).
Memory sizing
Section titled “Memory sizing”The spaCy en_core_web_lg model loads ~800 MB on startup. Default resource requests account for this:
resources: requests: memory: 1500Mi # 800 MB spaCy + headroom for requests limits: memory: 3Gi # headroom for concurrent request processingDo not reduce requests.memory below 1 Gi — the pod will be OOMKilled during model load.
PostgreSQL
Section titled “PostgreSQL”The bundled Bitnami PostgreSQL subchart deploys a single Primary. For production HA:
- Set
postgresql.enabled: false - Provision an external HA PostgreSQL (RDS Multi-AZ, Cloud SQL, etc.)
- Set
externalDatabase.url: postgresql+asyncpg://user:pass@host:5432/llm_proxy
Pod disruption budget
Section titled “Pod disruption budget”For zero-downtime rolling updates with replicaCount >= 2:
# pdb.yaml — apply separatelyapiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: llm-proxy-pdbspec: minAvailable: 1 selector: matchLabels: app.kubernetes.io/name: llm-proxy