POST /v1/chat/completions

The gateway’s main endpoint. Accepts OpenAI-format chat requests, routes to any of the 13 supported providers, streams responses back, and writes a trace to ClickHouse on the way out.

POST /v1/chat/completions HTTP/1.1
Host: localhost:8080
Content-Type: application/json

Request body

{
  "model": "openai/gpt-4o-mini",
  "messages": [
    { "role": "user", "content": "Hello" }
  ],
  "temperature": 0.7,
  "max_tokens": 200,
  "stream": false
}

Field	Type	Description
`model`	`string` (required)	`provider/model` (e.g. `openai/gpt-4o`), a bare name, or `"auto"` for semantic routing.
`messages`	`array` (required)	OpenAI-format messages. `role` ∈ `user	assistant	system	tool`.
`temperature`	`float`	`0.0`–`2.0`. Omitted uses the provider default.
`max_tokens`	`int`	Output cap.
`top_p`	`float`	Nucleus sampling.
`stream`	`bool`	`true` → Server-Sent Events. See Streaming.
`stop`	`string	array`	Stop sequence(s).
`tools`	`array`	OpenAI-format tool definitions. Engine translates to provider-native shapes.
`tool_choice`	`“auto"	"none"	"required”	object`	Force a specific tool or let the model pick.

Any OpenAI field not listed above is passed through to the provider untouched.

Response body (non-streaming)

{
  "id": "chatcmpl-xyz",
  "object": "chat.completion",
  "created": 1713465600,
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "Hello!" },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 2,
    "total_tokens": 10
  },
  "cost": {
    "input_cost_usd": 0.0000012,
    "output_cost_usd": 0.0000012,
    "total_cost_usd": 0.0000024
  }
}

The cost object is an OpenTracy extra; the rest matches OpenAI exactly.

Response headers

Header	Example	Meaning
`X-OpenTracy-Selected-Model`	`gpt-4o-mini`	Which concrete model answered.
`X-OpenTracy-Cluster-ID`	`84`	Semantic cluster assigned to the prompt (0–99).
`X-OpenTracy-Expected-Error`	`0.08`	Predicted error rate for the selected model.
`X-OpenTracy-Routing-Ms`	`1.3`	Time spent in routing decision.
`X-OpenTracy-Session-Id`	`sess_af91`	For multi-turn tool calls — echo back on next call.

Curl

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Say hello in three words."}]
  }'

TypeScript / Node (openai SDK)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "any", // engine holds provider keys
});

const resp = await client.chat.completions.create({
  model: "openai/gpt-4o-mini",
  messages: [{ role: "user", content: "Hello" }],
});

console.log(resp.choices[0].message.content);

Go (net/http)

body := []byte(`{
  "model": "openai/gpt-4o-mini",
  "messages": [{"role": "user", "content": "Hello"}]
}`)
req, _ := http.NewRequest("POST",
    "http://localhost:8080/v1/chat/completions", bytes.NewReader(body))
req.Header.Set("Content-Type", "application/json")
resp, _ := http.DefaultClient.Do(req)
defer resp.Body.Close()

Semantic auto-routing

Pass "model": "auto" and the engine picks per-prompt based on its learned cluster/error profiles:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Prove √2 is irrational."}]
  }'

The response headers show you which model was picked:

X-OpenTracy-Selected-Model: gpt-4o
X-OpenTracy-Cluster-ID: 47
X-OpenTracy-Expected-Error: 0.01

See /v1/route if you want the decision without generating a completion.

Streaming

Set "stream": true. Responses come back as Server-Sent Events:

curl -N http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "count to 5"}],
    "stream": true
  }'

data: {"id":"chatcmpl-xyz","choices":[{"delta":{"content":"1"},"index":0}]}
data: {"id":"chatcmpl-xyz","choices":[{"delta":{"content":", 2"},"index":0}]}
...
data: [DONE]

The engine translates Anthropic and Bedrock event-streams into OpenAI’s SSE format, so clients don’t need per-provider logic.

Tool calls

Pass OpenAI-format tools. The engine maps them to provider-native shapes (Anthropic tools, Gemini function declarations, etc.):

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4-6",
    "messages": [{"role": "user", "content": "Weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "parameters": {"type":"object","properties":{"city":{"type":"string"}}}
      }
    }]
  }'

The response comes back with tool_calls in the assistant message.

Errors

Status	`error.code`	Meaning
`400`	`invalid_request`	Malformed body / missing required field.
`401`	`unauthorized`	Bearer token missing or invalid (if auth is enabled).
`404`	`model_not_found`	Unknown model string.
`429`	`rate_limit`	Upstream provider rate-limited the request.
`500`	`provider_error`	Provider returned an error; body echoes their message.
`504`	`timeout`	Upstream took longer than the configured timeout.

​Request body

​Response body (non-streaming)

​Response headers

​Curl

​TypeScript / Node (openai SDK)

​Go (net/http)

​Semantic auto-routing

​Streaming

​Tool calls

​Errors