Responses

Every inference request returns a response object that follows the OpenAI chat completions format. This page covers the response structure, streaming behavior, and error handling.

Response object

A non-streaming response returns a single JSON object:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "phi-4-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 9,
    "total_tokens": 21
  }
}

Key fields

Field	Description
`choices[].message.content`	The model's response text
`choices[].finish_reason`	Why generation stopped: `stop`, `length`, `tool_calls`, or `content_filter`
`usage.prompt_tokens`	Tokens in the input
`usage.completion_tokens`	Tokens generated
`model`	The model that actually handled the request (relevant when routing is active)

Streaming

Set stream: true to receive tokens as they are generated. The response is a stream of server-sent events (SSE):

Python
Node.js
cURL

stream = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)

const stream = await client.chat.completions.create({
  model: "phi-4-mini",
  messages: [{ role: "user", content: "Explain quantum computing" }],
  stream: true,
});
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) process.stdout.write(delta);
}

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer edg_..." \
  -H "Content-Type: application/json" \
  -d '{"model":"phi-4-mini","messages":[{"role":"user","content":"Explain quantum computing"}],"stream":true}'

Each SSE event contains a chunk:

data: {"id":"chatcmpl-abc123","choices":[{"index":0,"delta":{"content":"Quantum"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","choices":[{"index":0,"delta":{"content":" computing"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

The final chunk has finish_reason: "stop" and an empty delta. The stream ends with data: [DONE].

Streaming metrics

When telemetry is enabled, Octomil automatically reports:

Time to first token (TTFT) -- latency from request to first chunk
Tokens per second -- generation throughput
Total latency -- end-to-end request time

Error responses

Errors follow the OpenAI error format:

{
  "error": {
    "message": "Model 'nonexistent-model' not found",
    "type": "invalid_request_error",
    "param": "model",
    "code": "model_not_found"
  }
}

HTTP status	Error type	Common causes
400	`invalid_request_error`	Missing required fields, invalid parameters
401	`authentication_error`	Invalid or missing API key
404	`not_found`	Model not found or not deployed
422	`invalid_request_error`	Schema validation failure (structured output)
429	`rate_limit_error`	Too many requests (cloud fallback only)
500	`server_error`	Internal inference failure

Finish reasons

Reason	Meaning
`stop`	Model finished naturally or hit a stop sequence
`length`	Hit `max_tokens` limit
`tool_calls`	Model is requesting a tool call -- see Tool calling
`content_filter`	Output was filtered by content policy

Control -- parameters that shape the response
Streaming Inference -- detailed streaming guide
Tool calling -- handling tool_calls finish reason

Response object​

Key fields​

Streaming​

Streaming metrics​

Error responses​

Finish reasons​

Related​