Local Inference
octomil serve phi-4-mini
Starts an OpenAI-compatible server on localhost:8080. Auto-selects the fastest engine for your hardware.
Send a request:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [{"role": "user", "content": "Explain federated learning in one sentence."}]
}'
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="gemma-1b",
messages=[{"role": "user", "content": "Explain federated learning in one sentence."}],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" }); // pragma: allowlist secret
const response = await client.chat.completions.create({
model: "gemma-1b",
messages: [{ role: "user", content: "Explain federated learning in one sentence." }],
});
console.log(response.choices[0].message.content);
Models
Any Hugging Face model, GGUF file, or local path works:
octomil serve meta-llama/Llama-3.2-1B-Instruct
octomil serve ./my-local-model.gguf
octomil serve ~/models/my-finetuned-model
Short aliases for popular models:
octomil serve phi-4-mini # 3.8B — strong reasoning
octomil serve llama-3.2-3b # 3B — general purpose
octomil serve gemma-4b # 4B — higher quality
octomil serve mistral-7b # 7B — instruction-following
octomil serve qwen-2.5-1.5b # 1.5B — multilingual
octomil serve smollm-360m # 360M — testing
octomil serve deepseek-r1:7b # 7B — reasoning
octomil serve codellama-7b # 7B — code generation
octomil serve whisper-base # 74M — speech-to-text
Append a variant for quantization control:
octomil serve gemma-4b:4bit # 4-bit (smallest, fastest)
octomil serve gemma-4b:8bit # 8-bit (better quality)
octomil serve gemma-4b:fp16 # Full precision
List all available models: octomil list models
See Supported Models for the full compatibility matrix.
Engine Support
Octomil includes multiple inference engines and selects the best one for your hardware automatically.
| Engine | Platform | Hardware | Priority | Notes |
|---|---|---|---|---|
mlx | macOS (Apple Silicon) | CPU + GPU (Metal) | 10 | Fastest on M1/M2/M3/M4 Macs. Uses unified memory. |
mnn | Cross-platform | Metal, Vulkan, CUDA, OpenCL | 15 | Alibaba's MNN-LLM runtime. Supports GGUF and .mnn models. Up to 25x faster than llama.cpp on some GPUs. |
mlc-llm | Cross-platform | Metal, Vulkan, CUDA, OpenCL, WebGPU | 18 | TVM-compiled native GPU inference. Supports all GPU backends. Install with pip install octomil[mlc]. |
llamacpp | Cross-platform | CPU, CUDA, Metal | 20 | Broad compatibility. GGUF quantized models. |
executorch | Mobile / Edge | CoreML, XNNPACK, Vulkan, QNN | 25 | Meta's on-device runtime for .pte models. Auto-selects best hardware delegate per platform. |
onnxruntime | Cross-platform | CPU, CUDA, TensorRT, DirectML, CoreML, OpenVINO | 30 | Most portable engine. Works everywhere — the safe fallback when specialized engines aren't available. Install with pip install octomil[onnx]. |
whisper | Cross-platform | CPU, Metal, CUDA | — | Speech-to-text engine powered by whisper.cpp. Serves /v1/audio/transcriptions (OpenAI-compatible). Install with pip install octomil[whisper]. |
echo | Any | CPU | last | Fallback engine for testing. Returns fixed responses. |
Auto-Benchmark
By default, octomil serve runs a short benchmark across available engines and selects the fastest:
[benchmark] mlx: 42.3 tok/s
[benchmark] llamacpp: 31.8 tok/s
[engine] Selected: mlx (fastest)
Engine Override
Force a specific engine with --engine:
octomil serve gemma-1b --engine mlx
octomil serve gemma-1b --engine mnn
octomil serve gemma-1b --engine llamacpp
octomil serve gemma-1b --engine executorch
octomil serve gemma-1b --engine onnxruntime
octomil serve whisper-base --engine whisper
octomil serve gemma-1b --engine echo
API Endpoints
POST /v1/chat/completions
OpenAI-compatible chat completions endpoint. Supports both streaming and non-streaming responses.
Non-streaming request:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is federated learning?"}
]
}'
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="gemma-1b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is federated learning?"},
],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const response = await client.chat.completions.create({
model: "gemma-1b",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is federated learning?" },
],
});
console.log(response.choices[0].message.content);
Streaming request:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [{"role": "user", "content": "Explain edge computing."}],
"stream": true
}'
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
stream = client.chat.completions.create(
model="gemma-1b",
messages=[{"role": "user", "content": "Explain edge computing."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const stream = await client.chat.completions.create({
model: "gemma-1b",
messages: [{ role: "user", content: "Explain edge computing." }],
stream: true,
});
for await (const chunk of stream) {
if (chunk.choices[0].delta?.content) {
process.stdout.write(chunk.choices[0].delta.content);
}
}
GET /v1/models
Lists available models on the server.
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/models
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
models = client.models.list()
for model in models.data:
print(model.id)
const response = await fetch("http://localhost:8080/v1/models");
const data = await response.json();
console.log(data);
{
"data": [
{
"id": "gemma-1b",
"object": "model",
"owned_by": "octomil"
}
]
}
GET /v1/engines
Lists available inference engines and their status.
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/engines
import requests
response = requests.get("http://localhost:8080/v1/engines")
print(response.json())
const response = await fetch("http://localhost:8080/v1/engines");
const data = await response.json();
console.log(data);
{
"engines": [
{"name": "mlx", "available": true, "platform": "darwin-arm64"},
{"name": "mnn", "available": true, "platform": "any"},
{"name": "llamacpp", "available": true, "platform": "any"},
{"name": "executorch", "available": false, "platform": "mobile"},
{"name": "echo", "available": true, "platform": "any"}
],
"active": "mlx"
}
GET /v1/cache/stats
Returns KV prefix cache statistics.
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/cache/stats
import requests
response = requests.get("http://localhost:8080/v1/cache/stats")
print(response.json())
const response = await fetch("http://localhost:8080/v1/cache/stats");
const data = await response.json();
console.log(data);
{
"entries": 12,
"memory_mb": 84.3,
"hit_rate": 0.73,
"max_size_mb": 512
}
POST /v1/audio/transcriptions
OpenAI-compatible speech-to-text endpoint. Available when serving a Whisper model.
octomil serve whisper-base
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/audio/transcriptions \
-F file=@recording.wav \
-F model=whisper-base \
-F language=en
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
with open("recording.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-base",
file=f,
language="en",
)
print(transcript.text)
import fs from "fs";
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const transcript = await client.audio.transcriptions.create({
model: "whisper-base",
file: fs.createReadStream("recording.wav"),
language: "en",
});
console.log(transcript.text);
{
"text": "Hello, this is a test recording.",
"language": "en",
"duration": 3.2
}
Available Whisper models: whisper-tiny (39M), whisper-base (74M), whisper-small (244M), whisper-medium (769M), whisper-large-v3 (1.5B).
GET /health
Health check endpoint.
- cURL
- Python
- JavaScript
curl http://localhost:8080/health
import requests
response = requests.get("http://localhost:8080/health")
print(response.json())
const response = await fetch("http://localhost:8080/health");
const data = await response.json();
console.log(data);
{
"status": "ok",
"engine": "mlx",
"model": "gemma-1b",
"uptime_seconds": 3421
}
JSON Mode
Force the model to output valid JSON using the --json flag at startup or the response_format field per request.
At startup:
octomil serve gemma-1b --json
Per request:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [{"role": "user", "content": "List 3 ML frameworks as JSON."}],
"response_format": {"type": "json_object"}
}'
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="gemma-1b",
messages=[{"role": "user", "content": "List 3 ML frameworks as JSON."}],
response_format={"type": "json_object"},
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const response = await client.chat.completions.create({
model: "gemma-1b",
messages: [{ role: "user", content: "List 3 ML frameworks as JSON." }],
response_format: { type: "json_object" },
});
console.log(response.choices[0].message.content);
KV Prefix Caching
Octomil caches key-value pairs from previous prompts to accelerate multi-turn conversations. When a new request shares a prefix with a previous one (for example, the same system prompt), the cached KV pairs are reused instead of recomputed.
This is most useful for:
- Multi-turn chat sessions with a shared system prompt
- Batch processing with identical prompt prefixes
- Applications that call the same model repeatedly with similar context
Configure cache size at startup:
octomil serve gemma-1b --cache-size 1024
The --cache-size value is in megabytes. Default is 2048 MB.
Request Queue
When multiple clients send requests concurrently, Octomil queues them and processes in FIFO order. This prevents request failures under load.
octomil serve gemma-1b --max-queue 64
The queue returns proper HTTP status codes:
- 503 Service Unavailable when the queue is full
- 504 Gateway Timeout when a request waits too long
Check queue status at runtime:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/queue/stats
import requests
response = requests.get("http://localhost:8080/v1/queue/stats")
print(response.json())
const response = await fetch("http://localhost:8080/v1/queue/stats");
const data = await response.json();
console.log(data);
{
"pending": 2,
"max_depth": 32,
"total_processed": 147,
"total_rejected": 0
}
Set --max-queue 0 to disable the queue and process requests directly (original behavior).
Configuration Options
| Option | Default | Description |
|---|---|---|
--port, -p | 8080 | HTTP server port |
--host | 0.0.0.0 | Host to bind to |
--engine, -e | auto | Force a specific engine (mlx, mnn, llamacpp, executorch, onnxruntime, whisper, echo) |
--json-mode | off | Enable JSON-only output mode |
--cache-size | 2048 | KV prefix cache size in MB |
--no-cache | off | Disable KV cache entirely |
--max-queue | 32 | Max pending requests in queue (0 to disable) |
--models | - | Comma-separated models for multi-model serving |
--auto-route | off | Enable automatic query routing (requires --models) |
--route-strategy | complexity | Routing strategy for --auto-route |
--api-key | none | Octomil API key for telemetry reporting |
--api-base | https://api.octomil.com | Octomil API base URL |
Example with all options:
octomil serve gemma-1b \
--port 9090 \
--engine mlx \
--json \
--cache-size 1024 \
--api-key <your-api-key>
Telemetry Integration
When you set --api-key, the serve command automatically reports inference metrics to your Octomil dashboard. This includes time to first chunk, throughput, and per-request latency.
octomil serve gemma-1b --api-key <your-api-key>
No additional configuration is needed. Telemetry runs in a background thread and never blocks inference. See the Telemetry and Observability documentation for details on what is reported and how to use the data.
Related Docs
- Telemetry and Observability -- inference telemetry details
- Move-to-Device Recommendations -- automated deployment recommendations based on telemetry
- Monitoring Dashboard -- view inference metrics in the dashboard
- Model Catalog -- model versioning and lifecycle management