Inference
POST /v1/chat/completions
Generate a response from a model with a provided prompt and conversation history. This is the primary endpoint for interacting with models served by octomil serve.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model name (e.g. phi-4-mini, gemma-1b). Use auto for automatic routing. |
messages | array | Yes | Conversation history. Each message has role (system, user, assistant) and content. |
stream | boolean | No | Stream the response. Default: false. |
temperature | number | No | Sampling temperature (0.0-2.0). Higher = more random. Default: 0.7. |
top_p | number | No | Nucleus sampling threshold. Default: 0.9. |
max_tokens | number | No | Maximum tokens to generate. Default: model-dependent. |
stop | string or array | No | Stop sequence(s). Generation stops when any sequence is produced. |
response_format | object | No | Force output format. See Structured Decoding. |
n | integer | No | Number of completions to generate. Default: 1. |
Request
- Python
- JavaScript
- iOS (Swift)
- Android (Kotlin)
- cURL
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="phi-4-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Why is the sky blue?"},
],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" }); // pragma: allowlist secret
const response = await client.chat.completions.create({
model: "phi-4-mini",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Why is the sky blue?" },
],
});
console.log(response.choices[0].message.content);
import Foundation
let url = URL(string: "http://localhost:8080/v1/chat/completions")!
var request = URLRequest(url: url)
request.httpMethod = "POST"
request.setValue("application/json", forHTTPHeaderField: "Content-Type")
let body: [String: Any] = [
"model": "phi-4-mini",
"messages": [
["role": "system", "content": "You are a helpful assistant."],
["role": "user", "content": "Why is the sky blue?"]
]
]
request.httpBody = try JSONSerialization.data(withJSONObject: body)
URLSession.shared.dataTask(with: request) { data, _, error in
guard error == nil, let data else { return }
print(String(data: data, encoding: .utf8) ?? "")
}.resume()
import okhttp3.MediaType.Companion.toMediaType
import okhttp3.OkHttpClient
import okhttp3.Request
import okhttp3.RequestBody.Companion.toRequestBody
import org.json.JSONArray
import org.json.JSONObject
val client = OkHttpClient()
val bodyJson = JSONObject()
.put("model", "phi-4-mini")
.put(
"messages",
JSONArray()
.put(JSONObject().put("role", "system").put("content", "You are a helpful assistant."))
.put(JSONObject().put("role", "user").put("content", "Why is the sky blue?"))
)
val req = Request.Builder()
.url("http://localhost:8080/v1/chat/completions")
.post(bodyJson.toString().toRequestBody("application/json".toMediaType()))
.build()
client.newCall(req).execute().use { resp ->
println(resp.body?.string())
}
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Why is the sky blue?"}
]
}'
Response
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1709000000,
"model": "phi-4-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The sky appears blue because of Rayleigh scattering..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 87,
"total_tokens": 111
}
}
Streaming
Set stream: true to receive server-sent events as tokens are generated.
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "Count to 5"}],
"stream": true
}'
stream = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "Count to 5"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
const stream = await client.chat.completions.create({
model: "phi-4-mini",
messages: [{ role: "user", content: "Count to 5" }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
Each streamed chunk:
{"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1709000000,"model":"phi-4-mini","choices":[{"index":0,"delta":{"content":"1"},"finish_reason":null}]}
The final chunk has finish_reason: "stop" and an empty delta.
JSON Mode
Force the model to output valid JSON:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "List 3 programming languages as JSON."}],
"response_format": {"type": "json_object"}
}'
See Structured Decoding for JSON Schema enforcement.
Response Headers
When optimizations are active, response headers provide telemetry:
| Header | Description |
|---|---|
X-Octomil-Routed-Model | Which model handled the request (when using auto routing) |
X-Octomil-Speculative | Whether speculative decoding was used |
X-Octomil-Early-Exit-Tokens | Number of tokens that exited early |
X-Octomil-Compression | Whether prompt compression was applied |
Errors
| Status | Error | Description |
|---|---|---|
400 | bad_request | Invalid or missing request fields |
401 | unauthorized | Missing or invalid API key |
404 | not_found | Resource does not exist |
409 | conflict | Resource already exists or state conflict |
429 | rate_limited | Too many requests; check Retry-After |
500 | internal_error | Unexpected server error |