Cloud Inference
Octomil provides a unified inference API that proxies requests to an LLM backend -- Ollama (default) or any OpenAI-compatible endpoint. This gives you a single API surface for both on-device and cloud inference, with automatic fallback.
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/v1/inference | POST | Single-shot inference |
/api/v1/inference/stream | POST | Streaming inference (SSE) |
/api/v1/inference/batch | POST | Batch inference |
/api/v1/embeddings | POST | Text embeddings |
/api/v1/inference/health | GET | Backend health check |
Single-Shot Inference
curl -X POST http://localhost:8000/api/v1/inference \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model_id": "phi-4-mini",
"messages": [{"role": "user", "content": "Explain gradient descent"}],
"parameters": {"temperature": 0.7, "max_tokens": 256}
}'
Response:
{
"output": "Gradient descent is an optimization algorithm...",
"latency_ms": 1234,
"provider": "ollama",
"model_id": "phi-4-mini"
}
Streaming Inference
The streaming endpoint returns Server-Sent Events (SSE):
curl -N -X POST http://localhost:8000/api/v1/inference/stream \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model_id": "phi-4-mini",
"messages": [{"role": "user", "content": "Write a haiku"}]
}'
Each SSE event contains:
{"token": "Cherry", "done": false, "provider": "ollama"}
{"token": " blossoms", "done": false, "provider": "ollama"}
{"token": "", "done": true, "provider": "ollama"}
See Streaming Inference for SDK-specific streaming patterns.
Batch Inference
Process multiple inputs in a single request with configurable concurrency:
curl -X POST http://localhost:8000/api/v1/inference/batch \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model_id": "phi-4-mini",
"inputs": [
{"messages": [{"role": "user", "content": "Hello"}]},
{"messages": [{"role": "user", "content": "Goodbye"}]}
],
"max_concurrency": 4
}'
Configuration
The LLM backend is configured via environment variables:
| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL | http://localhost:11434 | Ollama server URL |
OPENAI_API_ENDPOINT | — | OpenAI-compatible endpoint (fallback) |
OPENAI_API_KEY | — | API key for OpenAI-compatible backend |
LLM_BACKEND_TIMEOUT | 30 | Request timeout in seconds |
Octomil tries Ollama first. If unavailable, it falls back to the OpenAI-compatible endpoint.
SDK Integration
Every SDK can call these endpoints directly:
- Python:
client.predict(),client.stream_predict(),client.embed()— see Python SDK - Node:
client.predict(),client.streamPredict()— see Node SDK - iOS:
EmbeddingClient,RoutingClient.cloudInfer()— see iOS SDK - Android:
EmbeddingClient,RoutingClient.cloudInfer()— see Android SDK - Browser:
embed(),RoutingClient.cloudInfer()— see Browser SDK