Skip to main content

Streaming Inference

Stream LLM responses token-by-token as they're generated. Octomil's streaming uses Server-Sent Events (SSE) from the server and provides idiomatic wrappers in every SDK.

How It Works

  1. Client sends a POST to /api/v1/inference/stream
  2. Server proxies to the LLM backend (Ollama or OpenAI-compatible)
  3. Tokens stream back as SSE events: {"token": "...", "done": false}
  4. Final event has "done": true

SDK Examples

import octomil

client = octomil.Client(api_key="oct_...")

# Synchronous streaming
for token in client.stream_predict("phi-4-mini", messages=[
{"role": "user", "content": "Explain transformers"}
]):
print(token.token, end="", flush=True)

# Async streaming
async for token in client.stream_predict_async("phi-4-mini", messages=[
{"role": "user", "content": "Explain transformers"}
]):
print(token.token, end="", flush=True)

StreamToken Fields

FieldTypeDescription
tokenstringThe generated text fragment
donebooleantrue on the final event
providerstringBackend that generated the token (ollama, openai)

Performance Metrics

The Python and iOS SDKs track streaming performance automatically:

  • TTFC (Time to First Chunk) — latency before the first token arrives
  • Throughput — chunks per second
  • Average chunk latency — mean time between chunks
  • Total duration — end-to-end streaming time

These metrics are reported to the Octomil telemetry system for monitoring.