Skip to main content

Responses

Every inference request returns a response object that follows the OpenAI chat completions format. This page covers the response structure, streaming behavior, and error handling.

Response object

A non-streaming response returns a single JSON object:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1710000000,
"model": "phi-4-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 9,
"total_tokens": 21
}
}

Key fields

FieldDescription
choices[].message.contentThe model's response text
choices[].finish_reasonWhy generation stopped: stop, length, tool_calls, or content_filter
usage.prompt_tokensTokens in the input
usage.completion_tokensTokens generated
modelThe model that actually handled the request (relevant when routing is active)

Streaming

Set stream: true to receive tokens as they are generated. The response is a stream of server-sent events (SSE):

stream = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)

Each SSE event contains a chunk:

data: {"id":"chatcmpl-abc123","choices":[{"index":0,"delta":{"content":"Quantum"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","choices":[{"index":0,"delta":{"content":" computing"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

The final chunk has finish_reason: "stop" and an empty delta. The stream ends with data: [DONE].

Streaming metrics

When telemetry is enabled, Octomil automatically reports:

  • Time to first token (TTFT) -- latency from request to first chunk
  • Tokens per second -- generation throughput
  • Total latency -- end-to-end request time

Error responses

Errors follow the OpenAI error format:

{
"error": {
"message": "Model 'nonexistent-model' not found",
"type": "invalid_request_error",
"param": "model",
"code": "model_not_found"
}
}
HTTP statusError typeCommon causes
400invalid_request_errorMissing required fields, invalid parameters
401authentication_errorInvalid or missing API key
404not_foundModel not found or not deployed
422invalid_request_errorSchema validation failure (structured output)
429rate_limit_errorToo many requests (cloud fallback only)
500server_errorInternal inference failure

Finish reasons

ReasonMeaning
stopModel finished naturally or hit a stop sequence
lengthHit max_tokens limit
tool_callsModel is requesting a tool call -- see Tool calling
content_filterOutput was filtered by content policy