Skip to main content

Local Inference

octomil serve phi-4-mini

Starts an OpenAI-compatible server on localhost:8080. Auto-selects the fastest engine for your hardware.

Send a request:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [{"role": "user", "content": "Explain federated learning in one sentence."}]
}'

Models

Any Hugging Face model, GGUF file, or local path works:

octomil serve meta-llama/Llama-3.2-1B-Instruct
octomil serve ./my-local-model.gguf
octomil serve ~/models/my-finetuned-model

Short aliases for popular models:

octomil serve phi-4-mini          # 3.8B — strong reasoning
octomil serve llama-3.2-3b # 3B — general purpose
octomil serve gemma-4b # 4B — higher quality
octomil serve mistral-7b # 7B — instruction-following
octomil serve qwen-2.5-1.5b # 1.5B — multilingual
octomil serve smollm-360m # 360M — testing
octomil serve deepseek-r1:7b # 7B — reasoning
octomil serve codellama-7b # 7B — code generation
octomil serve whisper-base # 74M — speech-to-text

Append a variant for quantization control:

octomil serve gemma-4b:4bit       # 4-bit (smallest, fastest)
octomil serve gemma-4b:8bit # 8-bit (better quality)
octomil serve gemma-4b:fp16 # Full precision

List all available models: octomil list models

See Supported Models for the full compatibility matrix.

Engine Support

Octomil includes multiple inference engines and selects the best one for your hardware automatically.

EnginePlatformHardwarePriorityNotes
mlxmacOS (Apple Silicon)CPU + GPU (Metal)10Fastest on M1/M2/M3/M4 Macs. Uses unified memory.
mnnCross-platformMetal, Vulkan, CUDA, OpenCL15Alibaba's MNN-LLM runtime. Supports GGUF and .mnn models. Up to 25x faster than llama.cpp on some GPUs.
mlc-llmCross-platformMetal, Vulkan, CUDA, OpenCL, WebGPU18TVM-compiled native GPU inference. Supports all GPU backends. Install with pip install octomil[mlc].
llamacppCross-platformCPU, CUDA, Metal20Broad compatibility. GGUF quantized models.
executorchMobile / EdgeCoreML, XNNPACK, Vulkan, QNN25Meta's on-device runtime for .pte models. Auto-selects best hardware delegate per platform.
onnxruntimeCross-platformCPU, CUDA, TensorRT, DirectML, CoreML, OpenVINO30Most portable engine. Works everywhere — the safe fallback when specialized engines aren't available. Install with pip install octomil[onnx].
whisperCross-platformCPU, Metal, CUDASpeech-to-text engine powered by whisper.cpp. Serves /v1/audio/transcriptions (OpenAI-compatible). Install with pip install octomil[whisper].
echoAnyCPUlastFallback engine for testing. Returns fixed responses.

Auto-Benchmark

By default, octomil serve runs a short benchmark across available engines and selects the fastest:

[benchmark] mlx: 42.3 tok/s
[benchmark] llamacpp: 31.8 tok/s
[engine] Selected: mlx (fastest)

Engine Override

Force a specific engine with --engine:

octomil serve gemma-1b --engine mlx
octomil serve gemma-1b --engine mnn
octomil serve gemma-1b --engine llamacpp
octomil serve gemma-1b --engine executorch
octomil serve gemma-1b --engine onnxruntime
octomil serve whisper-base --engine whisper
octomil serve gemma-1b --engine echo

API Endpoints

POST /v1/chat/completions

OpenAI-compatible chat completions endpoint. Supports both streaming and non-streaming responses.

Non-streaming request:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is federated learning?"}
]
}'

Streaming request:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [{"role": "user", "content": "Explain edge computing."}],
"stream": true
}'

GET /v1/models

Lists available models on the server.

curl http://localhost:8080/v1/models
{
"data": [
{
"id": "gemma-1b",
"object": "model",
"owned_by": "octomil"
}
]
}

GET /v1/engines

Lists available inference engines and their status.

curl http://localhost:8080/v1/engines
{
"engines": [
{"name": "mlx", "available": true, "platform": "darwin-arm64"},
{"name": "mnn", "available": true, "platform": "any"},
{"name": "llamacpp", "available": true, "platform": "any"},
{"name": "executorch", "available": false, "platform": "mobile"},
{"name": "echo", "available": true, "platform": "any"}
],
"active": "mlx"
}

GET /v1/cache/stats

Returns KV prefix cache statistics.

curl http://localhost:8080/v1/cache/stats
{
"entries": 12,
"memory_mb": 84.3,
"hit_rate": 0.73,
"max_size_mb": 512
}

POST /v1/audio/transcriptions

OpenAI-compatible speech-to-text endpoint. Available when serving a Whisper model.

octomil serve whisper-base
curl http://localhost:8080/v1/audio/transcriptions \
-F file=@recording.wav \
-F model=whisper-base \
-F language=en
{
"text": "Hello, this is a test recording.",
"language": "en",
"duration": 3.2
}

Available Whisper models: whisper-tiny (39M), whisper-base (74M), whisper-small (244M), whisper-medium (769M), whisper-large-v3 (1.5B).

GET /health

Health check endpoint.

curl http://localhost:8080/health
{
"status": "ok",
"engine": "mlx",
"model": "gemma-1b",
"uptime_seconds": 3421
}

JSON Mode

Force the model to output valid JSON using the --json flag at startup or the response_format field per request.

At startup:

octomil serve gemma-1b --json

Per request:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [{"role": "user", "content": "List 3 ML frameworks as JSON."}],
"response_format": {"type": "json_object"}
}'

KV Prefix Caching

Octomil caches key-value pairs from previous prompts to accelerate multi-turn conversations. When a new request shares a prefix with a previous one (for example, the same system prompt), the cached KV pairs are reused instead of recomputed.

This is most useful for:

  • Multi-turn chat sessions with a shared system prompt
  • Batch processing with identical prompt prefixes
  • Applications that call the same model repeatedly with similar context

Configure cache size at startup:

octomil serve gemma-1b --cache-size 1024

The --cache-size value is in megabytes. Default is 2048 MB.

Request Queue

When multiple clients send requests concurrently, Octomil queues them and processes in FIFO order. This prevents request failures under load.

octomil serve gemma-1b --max-queue 64

The queue returns proper HTTP status codes:

  • 503 Service Unavailable when the queue is full
  • 504 Gateway Timeout when a request waits too long

Check queue status at runtime:

curl http://localhost:8080/v1/queue/stats
{
"pending": 2,
"max_depth": 32,
"total_processed": 147,
"total_rejected": 0
}

Set --max-queue 0 to disable the queue and process requests directly (original behavior).

Configuration Options

OptionDefaultDescription
--port, -p8080HTTP server port
--host0.0.0.0Host to bind to
--engine, -eautoForce a specific engine (mlx, mnn, llamacpp, executorch, onnxruntime, whisper, echo)
--json-modeoffEnable JSON-only output mode
--cache-size2048KV prefix cache size in MB
--no-cacheoffDisable KV cache entirely
--max-queue32Max pending requests in queue (0 to disable)
--models-Comma-separated models for multi-model serving
--auto-routeoffEnable automatic query routing (requires --models)
--route-strategycomplexityRouting strategy for --auto-route
--api-keynoneOctomil API key for telemetry reporting
--api-basehttps://api.octomil.comOctomil API base URL

Example with all options:

octomil serve gemma-1b \
--port 9090 \
--engine mlx \
--json \
--cache-size 1024 \
--api-key <your-api-key>

Telemetry Integration

When you set --api-key, the serve command automatically reports inference metrics to your Octomil dashboard. This includes time to first chunk, throughput, and per-request latency.

octomil serve gemma-1b --api-key <your-api-key>

No additional configuration is needed. Telemetry runs in a background thread and never blocks inference. See the Telemetry and Observability documentation for details on what is reported and how to use the data.