Skip to main content

Cloud Inference

Octomil provides a unified inference API that proxies requests to an LLM backend -- Ollama (default) or any OpenAI-compatible endpoint. This gives you a single API surface for both on-device and cloud inference, with automatic fallback.

API Endpoints

EndpointMethodDescription
/api/v1/inferencePOSTSingle-shot inference
/api/v1/inference/streamPOSTStreaming inference (SSE)
/api/v1/inference/batchPOSTBatch inference
/api/v1/embeddingsPOSTText embeddings
/api/v1/inference/healthGETBackend health check

Single-Shot Inference

curl -X POST http://localhost:8000/api/v1/inference \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model_id": "phi-4-mini",
"messages": [{"role": "user", "content": "Explain gradient descent"}],
"parameters": {"temperature": 0.7, "max_tokens": 256}
}'

Response:

{
"output": "Gradient descent is an optimization algorithm...",
"latency_ms": 1234,
"provider": "ollama",
"model_id": "phi-4-mini"
}

Streaming Inference

The streaming endpoint returns Server-Sent Events (SSE):

curl -N -X POST http://localhost:8000/api/v1/inference/stream \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model_id": "phi-4-mini",
"messages": [{"role": "user", "content": "Write a haiku"}]
}'

Each SSE event contains:

{"token": "Cherry", "done": false, "provider": "ollama"}
{"token": " blossoms", "done": false, "provider": "ollama"}
{"token": "", "done": true, "provider": "ollama"}

See Streaming Inference for SDK-specific streaming patterns.

Batch Inference

Process multiple inputs in a single request with configurable concurrency:

curl -X POST http://localhost:8000/api/v1/inference/batch \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model_id": "phi-4-mini",
"inputs": [
{"messages": [{"role": "user", "content": "Hello"}]},
{"messages": [{"role": "user", "content": "Goodbye"}]}
],
"max_concurrency": 4
}'

Configuration

The LLM backend is configured via environment variables:

VariableDefaultDescription
OLLAMA_BASE_URLhttp://localhost:11434Ollama server URL
OPENAI_API_ENDPOINTOpenAI-compatible endpoint (fallback)
OPENAI_API_KEYAPI key for OpenAI-compatible backend
LLM_BACKEND_TIMEOUT30Request timeout in seconds

Octomil tries Ollama first. If unavailable, it falls back to the OpenAI-compatible endpoint.

SDK Integration

Every SDK can call these endpoints directly:

  • Python: client.predict(), client.stream_predict(), client.embed() — see Python SDK
  • Node: client.predict(), client.streamPredict() — see Node SDK
  • iOS: EmbeddingClient, RoutingClient.cloudInfer() — see iOS SDK
  • Android: EmbeddingClient, RoutingClient.cloudInfer() — see Android SDK
  • Browser: embed(), RoutingClient.cloudInfer() — see Browser SDK