Streaming Inference
Stream LLM responses token-by-token as they're generated. Octomil's streaming uses Server-Sent Events (SSE) from the server and provides idiomatic wrappers in every SDK.
How It Works
- Client sends a POST to
/api/v1/inference/stream - Server proxies to the LLM backend (Ollama or OpenAI-compatible)
- Tokens stream back as SSE events:
{"token": "...", "done": false} - Final event has
"done": true
SDK Examples
- Python
- Node.js
- iOS (Swift)
- Android (Kotlin)
- Browser
import octomil
client = octomil.Client(api_key="oct_...")
# Synchronous streaming
for token in client.stream_predict("phi-4-mini", messages=[
{"role": "user", "content": "Explain transformers"}
]):
print(token.token, end="", flush=True)
# Async streaming
async for token in client.stream_predict_async("phi-4-mini", messages=[
{"role": "user", "content": "Explain transformers"}
]):
print(token.token, end="", flush=True)
import { OctomilClient } from '@octomil/node';
const client = new OctomilClient({ apiKey: 'oct_...' });
for await (const token of client.streamPredict('phi-4-mini', {
messages: [{ role: 'user', content: 'Explain transformers' }],
})) {
process.stdout.write(token.token);
}
let engine = MyStreamingEngine()
let wrapper = InstrumentedStreamWrapper(engine: engine)
let stream = wrapper.stream(model: model, input: input)
for try await chunk in stream {
print(String(data: chunk.data, encoding: .utf8) ?? "", terminator: "")
}
let metrics = wrapper.result()
print("TTFC: \(metrics.ttfcMs)ms, throughput: \(metrics.throughputChunksPerSec) chunks/s")
val engine = StreamingInferenceEngine { model, input ->
// Return Flow<InferenceChunk>
flow { emit(InferenceChunk(index = 0, data = bytes, modality = Modality.TEXT)) }
}
engine.stream(model, input).collect { chunk ->
print(String(chunk.data))
}
import { Octomil } from '@octomil/browser';
const client = new Octomil({ serverUrl: 'http://localhost:8000', apiKey: 'oct_...' });
for await (const chunk of client.chatStream('phi-4-mini', [
{ role: 'user', content: 'Explain transformers' }
])) {
document.getElementById('output').textContent += chunk;
}
StreamToken Fields
| Field | Type | Description |
|---|---|---|
token | string | The generated text fragment |
done | boolean | true on the final event |
provider | string | Backend that generated the token (ollama, openai) |
Performance Metrics
The Python and iOS SDKs track streaming performance automatically:
- TTFC (Time to First Chunk) — latency before the first token arrives
- Throughput — chunks per second
- Average chunk latency — mean time between chunks
- Total duration — end-to-end streaming time
These metrics are reported to the Octomil telemetry system for monitoring.