Responses
Every inference request returns a response object that follows the OpenAI chat completions format. This page covers the response structure, streaming behavior, and error handling.
Response object
A non-streaming response returns a single JSON object:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1710000000,
"model": "phi-4-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 9,
"total_tokens": 21
}
}
Key fields
| Field | Description |
|---|---|
choices[].message.content | The model's response text |
choices[].finish_reason | Why generation stopped: stop, length, tool_calls, or content_filter |
usage.prompt_tokens | Tokens in the input |
usage.completion_tokens | Tokens generated |
model | The model that actually handled the request (relevant when routing is active) |
Streaming
Set stream: true to receive tokens as they are generated. The response is a stream of server-sent events (SSE):
- Python
- Node.js
- cURL
stream = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
const stream = await client.chat.completions.create({
model: "phi-4-mini",
messages: [{ role: "user", content: "Explain quantum computing" }],
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) process.stdout.write(delta);
}
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer edg_..." \
-H "Content-Type: application/json" \
-d '{"model":"phi-4-mini","messages":[{"role":"user","content":"Explain quantum computing"}],"stream":true}'
Each SSE event contains a chunk:
data: {"id":"chatcmpl-abc123","choices":[{"index":0,"delta":{"content":"Quantum"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","choices":[{"index":0,"delta":{"content":" computing"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
The final chunk has finish_reason: "stop" and an empty delta. The stream ends with data: [DONE].
Streaming metrics
When telemetry is enabled, Octomil automatically reports:
- Time to first token (TTFT) -- latency from request to first chunk
- Tokens per second -- generation throughput
- Total latency -- end-to-end request time
Error responses
Errors follow the OpenAI error format:
{
"error": {
"message": "Model 'nonexistent-model' not found",
"type": "invalid_request_error",
"param": "model",
"code": "model_not_found"
}
}
| HTTP status | Error type | Common causes |
|---|---|---|
| 400 | invalid_request_error | Missing required fields, invalid parameters |
| 401 | authentication_error | Invalid or missing API key |
| 404 | not_found | Model not found or not deployed |
| 422 | invalid_request_error | Schema validation failure (structured output) |
| 429 | rate_limit_error | Too many requests (cloud fallback only) |
| 500 | server_error | Internal inference failure |
Finish reasons
| Reason | Meaning |
|---|---|
stop | Model finished naturally or hit a stop sequence |
length | Hit max_tokens limit |
tool_calls | Model is requesting a tool call -- see Tool calling |
content_filter | Output was filtered by content policy |
Related
- Control -- parameters that shape the response
- Streaming Inference -- detailed streaming guide
- Tool calling -- handling
tool_callsfinish reason