Inference

POST /v1/chat/completions

Generate a response from a model with a provided prompt and conversation history. This is the primary endpoint for interacting with models served by octomil serve.

Parameters

Parameter	Type	Required	Description
`model`	string	Yes	Model name (e.g. `phi-4-mini`, `gemma3-1b`). Use `auto` for automatic routing.
`messages`	array	Yes	Conversation history. Each message has `role` (`system`, `user`, `assistant`) and `content`.
`stream`	boolean	No	Stream the response. Default: `false`.
`temperature`	number	No	Sampling temperature (0.0-2.0). Higher = more random. Default: `0.7`.
`top_p`	number	No	Nucleus sampling threshold. Default: `0.9`.
`max_tokens`	number	No	Maximum tokens to generate. Default: model-dependent.
`stop`	string or array	No	Stop sequence(s). Generation stops when any sequence is produced.
`response_format`	object	No	Force output format. See Structured Decoding.
`n`	integer	No	Number of completions to generate. Default: `1`.

Request

Python
JavaScript
iOS (Swift)
Android (Kotlin)
cURL

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Why is the sky blue?"},
    ],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" }); // pragma: allowlist secret

const response = await client.chat.completions.create({
  model: "phi-4-mini",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Why is the sky blue?" },
  ],
});
console.log(response.choices[0].message.content);

import Foundation

let url = URL(string: "http://localhost:8080/v1/chat/completions")!
var request = URLRequest(url: url)
request.httpMethod = "POST"
request.setValue("application/json", forHTTPHeaderField: "Content-Type")

let body: [String: Any] = [
  "model": "phi-4-mini",
  "messages": [
    ["role": "system", "content": "You are a helpful assistant."],
    ["role": "user", "content": "Why is the sky blue?"]
  ]
]
request.httpBody = try JSONSerialization.data(withJSONObject: body)

URLSession.shared.dataTask(with: request) { data, _, error in
  guard error == nil, let data else { return }
  print(String(data: data, encoding: .utf8) ?? "")
}.resume()

import okhttp3.MediaType.Companion.toMediaType
import okhttp3.OkHttpClient
import okhttp3.Request
import okhttp3.RequestBody.Companion.toRequestBody
import org.json.JSONArray
import org.json.JSONObject

val client = OkHttpClient()
val bodyJson = JSONObject()
  .put("model", "phi-4-mini")
  .put(
    "messages",
    JSONArray()
      .put(JSONObject().put("role", "system").put("content", "You are a helpful assistant."))
      .put(JSONObject().put("role", "user").put("content", "Why is the sky blue?"))
  )

val req = Request.Builder()
  .url("http://localhost:8080/v1/chat/completions")
  .post(bodyJson.toString().toRequestBody("application/json".toMediaType()))
  .build()

client.newCall(req).execute().use { resp ->
  println(resp.body?.string())
}

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-4-mini",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Why is the sky blue?"}
    ]
  }'

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1709000000,
  "model": "phi-4-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The sky appears blue because of Rayleigh scattering..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 87,
    "total_tokens": 111
  }
}

Streaming

Set stream: true to receive server-sent events as tokens are generated.

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-4-mini",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

stream = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

const stream = await client.chat.completions.create({
  model: "phi-4-mini",
  messages: [{ role: "user", content: "Count to 5" }],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

Each streamed chunk:

{"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1709000000,"model":"phi-4-mini","choices":[{"index":0,"delta":{"content":"1"},"finish_reason":null}]}

The final chunk has finish_reason: "stop" and an empty delta.

JSON Mode

Force the model to output valid JSON:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-4-mini",
    "messages": [{"role": "user", "content": "List 3 programming languages as JSON."}],
    "response_format": {"type": "json_object"}
  }'

See Structured Decoding for JSON Schema enforcement.

Response Headers

When optimizations are active, response headers provide telemetry:

Header	Description
`X-Octomil-Routed-Model`	Which model handled the request (when using `auto` routing)
`X-Octomil-Speculative`	Whether speculative decoding was used
`X-Octomil-Early-Exit-Tokens`	Number of tokens that exited early
`X-Octomil-Compression`	Whether prompt compression was applied

Errors

Status	Error	Description
`400`	`bad_request`	Invalid or missing request fields
`401`	`unauthorized`	Missing or invalid API key
`404`	`not_found`	Resource does not exist
`409`	`conflict`	Resource already exists or state conflict
`429`	`rate_limited`	Too many requests; check `Retry-After`
`500`	`internal_error`	Unexpected server error

Parameters​

Request​

Response​

Streaming​

JSON Mode​

Response Headers​

Errors​