Cloud Inference

Octomil provides a unified inference API that proxies requests to an LLM backend -- Ollama (default) or any OpenAI-compatible endpoint. This gives you a single API surface for both on-device and cloud inference, with automatic fallback.

API Endpoints

Endpoint	Method	Description
`/api/v1/inference`	POST	Single-shot inference
`/api/v1/inference/stream`	POST	Streaming inference (SSE)
`/api/v1/inference/batch`	POST	Batch inference
`/api/v1/embeddings`	POST	Text embeddings
`/api/v1/inference/health`	GET	Backend health check

Single-Shot Inference

curl -X POST http://localhost:8000/api/v1/inference \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "phi-4-mini",
    "messages": [{"role": "user", "content": "Explain gradient descent"}],
    "parameters": {"temperature": 0.7, "max_tokens": 256}
  }'

Response:

{
  "output": "Gradient descent is an optimization algorithm...",
  "latency_ms": 1234,
  "provider": "ollama",
  "model_id": "phi-4-mini"
}

Streaming Inference

The streaming endpoint returns Server-Sent Events (SSE):

curl -N -X POST http://localhost:8000/api/v1/inference/stream \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "phi-4-mini",
    "messages": [{"role": "user", "content": "Write a haiku"}]
  }'

Each SSE event contains:

{"token": "Cherry", "done": false, "provider": "ollama"}
{"token": " blossoms", "done": false, "provider": "ollama"}
{"token": "", "done": true, "provider": "ollama"}

See Streaming Inference for SDK-specific streaming patterns.

Batch Inference

Process multiple inputs in a single request with configurable concurrency:

curl -X POST http://localhost:8000/api/v1/inference/batch \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "phi-4-mini",
    "inputs": [
      {"messages": [{"role": "user", "content": "Hello"}]},
      {"messages": [{"role": "user", "content": "Goodbye"}]}
    ],
    "max_concurrency": 4
  }'

Configuration

The LLM backend is configured via environment variables:

Variable	Default	Description
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama server URL
`OPENAI_API_ENDPOINT`	—	OpenAI-compatible endpoint (fallback)
`OPENAI_API_KEY`	—	API key for OpenAI-compatible backend
`LLM_BACKEND_TIMEOUT`	`30`	Request timeout in seconds

Octomil tries Ollama first. If unavailable, it falls back to the OpenAI-compatible endpoint.

SDK Integration

Every SDK can call these endpoints directly:

Python: client.predict(), client.stream_predict(), client.embed() — see Python SDK
Node: client.predict(), client.streamPredict() — see Node SDK
iOS: EmbeddingClient, RoutingClient.cloudInfer() — see iOS SDK
Android: EmbeddingClient, RoutingClient.cloudInfer() — see Android SDK
Browser: embed(), RoutingClient.cloudInfer() — see Browser SDK

API Endpoints​

Single-Shot Inference​

Streaming Inference​

Batch Inference​

Configuration​

SDK Integration​