Local Inference

For one-shot inference without starting a server, use octomil run, octomil embed, or octomil transcribe. No server or API key needed. See the CLI Reference.

For an OpenAI-compatible local server:

octomil serve phi-4-mini

Starts an OpenAI-compatible server on localhost:8080. Auto-selects the fastest engine for your hardware.

Send a request:

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3-1b",
    "messages": [{"role": "user", "content": "Explain federated learning in one sentence."}]
  }'

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="gemma3-1b",
    messages=[{"role": "user", "content": "Explain federated learning in one sentence."}],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-needed",
}); // pragma: allowlist secret

const response = await client.chat.completions.create({
  model: "gemma3-1b",
  messages: [
    { role: "user", content: "Explain federated learning in one sentence." },
  ],
});
console.log(response.choices[0].message.content);

Models

Any Hugging Face model, GGUF file, or local path works:

octomil serve meta-llama/Llama-3.2-1B-Instruct
octomil serve ./my-local-model.gguf
octomil serve ~/models/my-finetuned-model

Short aliases for popular models:

octomil serve phi-4-mini          # 3.8B — strong reasoning
octomil serve llama-3.2-3b        # 3B — general purpose
octomil serve gemma3-4b            # 4B — higher quality
octomil serve mistral-7b          # 7B — instruction-following
octomil serve qwen-2.5-1.5b       # 1.5B — multilingual
octomil serve smollm-360m         # 360M — testing
octomil serve deepseek-r1:7b      # 7B — reasoning
octomil serve codellama-7b        # 7B — code generation
octomil serve whisper-base        # 74M — speech-to-text

Append a variant for quantization control:

octomil serve gemma3-4b:4bit       # 4-bit (smallest, fastest)
octomil serve gemma3-4b:8bit       # 8-bit (better quality)
octomil serve gemma3-4b:fp16       # Full precision

List all available models: octomil list models

See Supported Models for the full compatibility matrix.

Engine Support

Octomil includes multiple inference engines and selects the best one for your hardware automatically.

Engine	Platform	Hardware	Priority	Notes
`mlx`	macOS (Apple Silicon)	CPU + GPU (Metal)	10	Fastest on M1/M2/M3/M4 Macs. Uses unified memory.
`mnn`	Cross-platform	Metal, Vulkan, CUDA, OpenCL	15	Alibaba's MNN-LLM runtime. Supports GGUF and `.mnn` models. Up to 25x faster than llama.cpp on some GPUs.
`mlc-llm`	Cross-platform	Metal, Vulkan, CUDA, OpenCL, WebGPU	18	TVM-compiled native GPU inference. Supports all GPU backends. Install with `octomil install mlc`.
`llamacpp`	Cross-platform	CPU, CUDA, Metal	20	Broad compatibility. GGUF quantized models.
`executorch`	Mobile / Edge	CoreML, XNNPACK, Vulkan, QNN	25	Meta's on-device runtime for `.pte` models. Auto-selects best hardware delegate per platform.
`onnxruntime`	Cross-platform	CPU, CUDA, TensorRT, DirectML, CoreML, OpenVINO	30	Most portable engine. Works everywhere — the safe fallback when specialized engines aren't available. Install with `octomil install onnx`.
`whisper`	Cross-platform	CPU, Metal, CUDA	—	Speech-to-text engine powered by whisper.cpp. Serves `/v1/audio/transcriptions` (OpenAI-compatible). Install with `octomil install whisper`.
`echo`	Any	CPU	last	Fallback engine for testing. Returns fixed responses.

Auto-Benchmark

By default, octomil serve runs a short benchmark across available engines and selects the fastest:

[benchmark] mlx: 42.3 tok/s
[benchmark] llamacpp: 31.8 tok/s
[engine] Selected: mlx (fastest)

Engine Override

Force a specific engine with --engine:

octomil serve gemma3-1b --engine mlx
octomil serve gemma3-1b --engine mnn
octomil serve gemma3-1b --engine llamacpp
octomil serve gemma3-1b --engine executorch
octomil serve gemma3-1b --engine onnxruntime
octomil serve whisper-base --engine whisper
octomil serve gemma3-1b --engine echo

API Endpoints

POST /v1/chat/completions

OpenAI-compatible chat completions endpoint. Supports both streaming and non-streaming responses.

Non-streaming request:

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3-1b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is federated learning?"}
    ]
  }'

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="gemma3-1b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is federated learning?"},
    ],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "gemma3-1b",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is federated learning?" },
  ],
});
console.log(response.choices[0].message.content);

Streaming request:

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3-1b",
    "messages": [{"role": "user", "content": "Explain edge computing."}],
    "stream": true
  }'

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

stream = client.chat.completions.create(
    model="gemma3-1b",
    messages=[{"role": "user", "content": "Explain edge computing."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-needed",
});

const stream = await client.chat.completions.create({
  model: "gemma3-1b",
  messages: [{ role: "user", content: "Explain edge computing." }],
  stream: true,
});
for await (const chunk of stream) {
  if (chunk.choices[0].delta?.content) {
    process.stdout.write(chunk.choices[0].delta.content);
  }
}

GET /v1/models

Lists available models on the server.

cURL
Python
JavaScript

curl http://localhost:8080/v1/models

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

models = client.models.list()
for model in models.data:
    print(model.id)

const response = await fetch("http://localhost:8080/v1/models");
const data = await response.json();
console.log(data);

{
  "data": [
    {
      "id": "gemma3-1b",
      "object": "model",
      "owned_by": "octomil"
    }
  ]
}

GET /v1/engines

Lists available inference engines and their status.

cURL
Python
JavaScript

curl http://localhost:8080/v1/engines

import requests

response = requests.get("http://localhost:8080/v1/engines")
print(response.json())

const response = await fetch("http://localhost:8080/v1/engines");
const data = await response.json();
console.log(data);

{
  "engines": [
    { "name": "mlx", "available": true, "platform": "darwin-arm64" },
    { "name": "mnn", "available": true, "platform": "any" },
    { "name": "llamacpp", "available": true, "platform": "any" },
    { "name": "executorch", "available": false, "platform": "mobile" },
    { "name": "echo", "available": true, "platform": "any" }
  ],
  "active": "mlx"
}

GET /v1/cache/stats

Returns KV prefix cache statistics.

cURL
Python
JavaScript

curl http://localhost:8080/v1/cache/stats

import requests

response = requests.get("http://localhost:8080/v1/cache/stats")
print(response.json())

const response = await fetch("http://localhost:8080/v1/cache/stats");
const data = await response.json();
console.log(data);

{
  "entries": 12,
  "memory_mb": 84.3,
  "hit_rate": 0.73,
  "max_size_mb": 512
}

POST /v1/audio/transcriptions

OpenAI-compatible speech-to-text endpoint. Available when serving a Whisper model.

octomil serve whisper-base

cURL
Python
JavaScript

curl http://localhost:8080/v1/audio/transcriptions \
  -F file=@recording.wav \
  -F model=whisper-base \
  -F language=en

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

with open("recording.wav", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-base",
        file=f,
        language="en",
    )
print(transcript.text)

import fs from "fs";
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-needed",
});

const transcript = await client.audio.transcriptions.create({
  model: "whisper-base",
  file: fs.createReadStream("recording.wav"),
  language: "en",
});
console.log(transcript.text);

{
  "text": "Hello, this is a test recording.",
  "language": "en",
  "duration": 3.2
}

Available Whisper models: whisper-tiny (39M), whisper-base (74M), whisper-small (244M), whisper-medium (769M), whisper-large-v3 (1.5B).

GET /health

Health check endpoint.

cURL
Python
JavaScript

curl http://localhost:8080/health

import requests

response = requests.get("http://localhost:8080/health")
print(response.json())

const response = await fetch("http://localhost:8080/health");
const data = await response.json();
console.log(data);

{
  "status": "ok",
  "engine": "mlx",
  "model": "gemma3-1b",
  "uptime_seconds": 3421
}

JSON Mode

Force the model to output valid JSON using the --json flag at startup or the response_format field per request.

At startup:

octomil serve gemma3-1b --json

Per request:

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3-1b",
    "messages": [{"role": "user", "content": "List 3 ML frameworks as JSON."}],
    "response_format": {"type": "json_object"}
  }'

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="gemma3-1b",
    messages=[{"role": "user", "content": "List 3 ML frameworks as JSON."}],
    response_format={"type": "json_object"},
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "gemma3-1b",
  messages: [{ role: "user", content: "List 3 ML frameworks as JSON." }],
  response_format: { type: "json_object" },
});
console.log(response.choices[0].message.content);

KV Prefix Caching

Octomil caches key-value pairs from previous prompts to accelerate multi-turn conversations. When a new request shares a prefix with a previous one (for example, the same system prompt), the cached KV pairs are reused instead of recomputed.

This is most useful for:

Multi-turn chat sessions with a shared system prompt
Batch processing with identical prompt prefixes
Applications that call the same model repeatedly with similar context

Configure cache size at startup:

octomil serve gemma3-1b --cache-size 1024

The --cache-size value is in megabytes. Default is 2048 MB.

Request Queue

When multiple clients send requests concurrently, Octomil queues them and processes in FIFO order. This prevents request failures under load.

octomil serve gemma3-1b --max-queue 64

The queue returns proper HTTP status codes:

503 Service Unavailable when the queue is full
504 Gateway Timeout when a request waits too long

Check queue status at runtime:

cURL
Python
JavaScript

curl http://localhost:8080/v1/queue/stats

import requests

response = requests.get("http://localhost:8080/v1/queue/stats")
print(response.json())

const response = await fetch("http://localhost:8080/v1/queue/stats");
const data = await response.json();
console.log(data);

{
  "pending": 2,
  "max_depth": 32,
  "total_processed": 147,
  "total_rejected": 0
}

Set --max-queue 0 to disable the queue and process requests directly (original behavior).

Configuration Options

Option	Default	Description
`--port, -p`	`8080`	HTTP server port
`--host`	`0.0.0.0`	Host to bind to
`--engine, -e`	auto	Force a specific engine (`mlx`, `mnn`, `llamacpp`, `executorch`, `onnxruntime`, `whisper`, `echo`)
`--json-mode`	off	Enable JSON-only output mode
`--cache-size`	`2048`	KV prefix cache size in MB
`--no-cache`	off	Disable KV cache entirely
`--max-queue`	`32`	Max pending requests in queue (0 to disable)
`--models`	-	Comma-separated models for multi-model serving
`--auto-route`	off	Enable automatic query routing (requires `--models`)
`--route-strategy`	`complexity`	Routing strategy for `--auto-route`
`--api-key`	none	Octomil API key for telemetry reporting
`--api-base`	`https://api.octomil.com`	Octomil API base URL

Example with all options:

octomil serve gemma3-1b \
  --port 9090 \
  --engine mlx \
  --json \
  --cache-size 1024 \
  --api-key <your-api-key>

Telemetry Integration

When you set --api-key, the serve command automatically reports inference metrics to your Octomil dashboard. This includes time to first chunk, throughput, and per-request latency.

octomil serve gemma3-1b --api-key <your-api-key>

No additional configuration is needed. Telemetry runs in a background thread and never blocks inference. See the Telemetry and Observability documentation for details on what is reported and how to use the data.

Telemetry and Observability -- inference telemetry details
Move-to-Device Recommendations -- automated deployment recommendations based on telemetry
Monitoring Dashboard -- view inference metrics in the dashboard
Model Catalog -- model versioning and lifecycle management

Models​

Engine Support​

Auto-Benchmark​

Engine Override​

API Endpoints​

POST /v1/chat/completions​

GET /v1/models​

GET /v1/engines​

GET /v1/cache/stats​

POST /v1/audio/transcriptions​

GET /health​

JSON Mode​

KV Prefix Caching​

Request Queue​

Configuration Options​

Telemetry Integration​

Related Docs​