Capabilities
Octomil provides a full inference stack that runs on-device. Here is what you can do with it.
Local inference
Run any supported model locally with a single command:
octomil serve phi-4-mini
The server exposes an OpenAI-compatible API at http://localhost:8080/v1. Supports chat completions, embeddings, and streaming.
See: Octomil Serve
Structured decoding
Guarantee valid JSON output by enforcing schemas at the token level. No retries, no post-processing -- every response is valid on the first attempt.
See: Structured Decoding
Speculative decoding
Use a small draft model to accelerate generation from a larger model. The draft model proposes tokens in parallel; the main model verifies them. Throughput improves 2-3x with no quality loss.
See: Speculative Decoding
Streaming
Receive tokens as they are generated for real-time UIs. Standard SSE format, compatible with OpenAI client libraries.
See: Streaming Inference
Embeddings
Generate vector representations for search, RAG, and similarity features. Runs on-device with the same model serving infrastructure.
See: Embeddings
Cloud inference
Route requests to cloud models when a query exceeds on-device model capability. Transparent fallback with no code changes -- the SDK handles routing automatically.
See: Cloud Inference
Quality evaluation
Run evaluation suites against deployed models to measure accuracy, latency, and regression before rolling out to your fleet.
See: Quality Evaluation