RAG integration
RAG (Retrieval-Augmented Generation) runs at stage 06. The proxy automatically enriches requests with relevant context from your knowledge base — your application code doesn’t need to change.
How it works
Section titled “How it works”- The last user message is embedded with
all-MiniLM-L6-v2(sentence-transformers, ~80 MB) - ChromaDB is queried for the top-k chunks above the score threshold
- Retrieved chunks are prepended to the system message before the LLM call
The prompt sent to the LLM becomes:
[system]Relevant context:---<chunk 1 text>---<chunk 2 text>---
<original system message, if any>
[user]<original user message>Configuration
Section titled “Configuration”rag: enabled: true top_k: 5 score_threshold: 0.4 embedding_model: all-MiniLM-L6-v2Ingesting documents
Section titled “Ingesting documents”From a directory
Section titled “From a directory”curl -X POST http://localhost:8000/internal/kb/ingest-directory \ -H "Authorization: Bearer $PROXY_MASTER_KEY" \ -H "Content-Type: application/json" \ -d '{"directory": "knowledge_base"}'The directory path is relative to the proxy’s working directory. Supported formats: .txt, .md, .rst.
Response:
{ "ingested_files": 12, "total_chunks": 348, "files": ["docs/api.md", "docs/guide.md", ...]}Single file upload
Section titled “Single file upload”curl -X POST http://localhost:8000/internal/kb/upload \ -H "Authorization: Bearer $PROXY_MASTER_KEY" \ -F "file=@./runbook.md"CLI script
Section titled “CLI script”python scripts/ingest_kb.py --directory ./docsStorage
Section titled “Storage”ChromaDB persists to disk at chroma_data/ (configurable). In Kubernetes this maps to a PVC:
persistence: chroma: size: 10Gi storageClass: "" # cluster default accessMode: ReadWriteOnceTuning retrieval
Section titled “Tuning retrieval”| Parameter | Effect |
|---|---|
top_k: 3 | Fewer chunks → less context noise, lower cost |
top_k: 10 | More context, but may hit max_input_tokens |
score_threshold: 0.6 | Higher = stricter matching, fewer false positives |
score_threshold: 0.2 | Very broad matching — useful for short queries |
Disabling per-request
Section titled “Disabling per-request”There is no per-request override — RAG is either on or off globally. To disable for a specific use case, deploy a separate proxy instance with rag.enabled: false.