Skip to main content

Capabilities

Octomil provides a full inference stack that runs on-device. Here is what you can do with it.

Local inference

Run any supported model locally with a single command:

octomil serve phi-4-mini

The server exposes an OpenAI-compatible API at http://localhost:8080/v1. Supports chat completions, embeddings, and streaming.

See: Octomil Serve

Structured decoding

Guarantee valid JSON output by enforcing schemas at the token level. No retries, no post-processing -- every response is valid on the first attempt.

See: Structured Decoding

Speculative decoding

Use a small draft model to accelerate generation from a larger model. The draft model proposes tokens in parallel; the main model verifies them. Throughput improves 2-3x with no quality loss.

See: Speculative Decoding

Streaming

Receive tokens as they are generated for real-time UIs. Standard SSE format, compatible with OpenAI client libraries.

See: Streaming Inference

Embeddings

Generate vector representations for search, RAG, and similarity features. Runs on-device with the same model serving infrastructure.

See: Embeddings

Cloud inference

Route requests to cloud models when a query exceeds on-device model capability. Transparent fallback with no code changes -- the SDK handles routing automatically.

See: Cloud Inference

Quality evaluation

Run evaluation suites against deployed models to measure accuracy, latency, and regression before rolling out to your fleet.

See: Quality Evaluation