POST /v1/chat/completions
Full OpenAI Chat Completions API compatibility. Any client or SDK built for OpenAI works with zero code changes — only the base_url needs updating.
Request
Section titled “Request”POST /v1/chat/completionsAuthorization: Bearer <api-key>Content-Type: application/json| Field | Type | Required | Description |
|---|---|---|---|
model | string | yes | Model ID. Must be in allowedModels or a defined alias. |
messages | array | yes | Array of {role, content} objects |
stream | bool | no | false (default). Set true for SSE streaming. |
temperature | float | no | Sampling temperature (0–2) |
max_tokens | int | no | Maximum output tokens. Capped by perModelMaxTokens if set. |
tools | array | no | OpenAI tool/function definitions |
tool_choice | string/object | no | auto, none, or specific tool |
top_p | float | no | Nucleus sampling |
frequency_penalty | float | no | |
presence_penalty | float | no | |
user | string | no | Passed through to provider |
Example
Section titled “Example”{ "model": "gpt-4o", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is RAG?"} ], "max_tokens": 512}Response (non-streaming)
Section titled “Response (non-streaming)”Standard OpenAI ChatCompletion object:
{ "id": "chatcmpl-...", "object": "chat.completion", "created": 1710000000, "model": "gpt-4o", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "RAG stands for Retrieval-Augmented Generation..." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 42, "completion_tokens": 87, "total_tokens": 129 }}Streaming
Section titled “Streaming”Set stream: true to receive Server-Sent Events:
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"RAG"},"index":0}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":" stands"},"index":0}]}
data: [DONE]Python (OpenAI SDK)
Section titled “Python (OpenAI SDK)”from openai import OpenAI
client = OpenAI( base_url="https://proxy.internal/v1", api_key="llmp_...",)
# Non-streamingresponse = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello"}],)print(response.choices[0].message.content)
# Streamingfor chunk in client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Count to 5"}], stream=True,): print(chunk.choices[0].delta.content or "", end="")Model routing
Section titled “Model routing”The model field is processed as follows:
- Check
model_aliases— rewrite if a match is found - Validate against
allowed_models— return 400 if not allowed - Check
per_model_max_tokens— capmax_tokensif set - Route to the appropriate provider via LiteLLM
To route an Anthropic model through this endpoint:
{"model": "claude-3-5-sonnet-20241022", ...}LiteLLM detects the provider from the model name prefix automatically.