Skip to main content

Inference

POST /v1/chat/completions

Generate a response from a model with a provided prompt and conversation history. This is the primary endpoint for interacting with models served by octomil serve.

Parameters

ParameterTypeRequiredDescription
modelstringYesModel name (e.g. phi-4-mini, gemma-1b). Use auto for automatic routing.
messagesarrayYesConversation history. Each message has role (system, user, assistant) and content.
streambooleanNoStream the response. Default: false.
temperaturenumberNoSampling temperature (0.0-2.0). Higher = more random. Default: 0.7.
top_pnumberNoNucleus sampling threshold. Default: 0.9.
max_tokensnumberNoMaximum tokens to generate. Default: model-dependent.
stopstring or arrayNoStop sequence(s). Generation stops when any sequence is produced.
response_formatobjectNoForce output format. See Structured Decoding.
nintegerNoNumber of completions to generate. Default: 1.

Request

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
model="phi-4-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Why is the sky blue?"},
],
)
print(response.choices[0].message.content)

Response

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1709000000,
"model": "phi-4-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The sky appears blue because of Rayleigh scattering..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 87,
"total_tokens": 111
}
}

Streaming

Set stream: true to receive server-sent events as tokens are generated.

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "Count to 5"}],
"stream": true
}'

Each streamed chunk:

{"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1709000000,"model":"phi-4-mini","choices":[{"index":0,"delta":{"content":"1"},"finish_reason":null}]}

The final chunk has finish_reason: "stop" and an empty delta.

JSON Mode

Force the model to output valid JSON:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "List 3 programming languages as JSON."}],
"response_format": {"type": "json_object"}
}'

See Structured Decoding for JSON Schema enforcement.

Response Headers

When optimizations are active, response headers provide telemetry:

HeaderDescription
X-Octomil-Routed-ModelWhich model handled the request (when using auto routing)
X-Octomil-SpeculativeWhether speculative decoding was used
X-Octomil-Early-Exit-TokensNumber of tokens that exited early
X-Octomil-CompressionWhether prompt compression was applied

Errors

StatusErrorDescription
400bad_requestInvalid or missing request fields
401unauthorizedMissing or invalid API key
404not_foundResource does not exist
409conflictResource already exists or state conflict
429rate_limitedToo many requests; check Retry-After
500internal_errorUnexpected server error