CLI Reference

Installation

Binary
Homebrew
Windows

curl -fsSL https://get.octomil.com | sh

Downloads a standalone binary — no Python required.

brew install octomil/octomil/octomil

irm https://get.octomil.com/install.ps1 | iex

Verify:

octomil --version

Commands

octomil run

Run a single inference prompt locally. No server or API key needed.

octomil run <model> "<prompt>" [options]

Option	Default	Description
`--engine, -e`	auto	Force engine
`--max-tokens`	`256`	Max tokens to generate
`--temperature`	`0.7`	Sampling temperature
`--json`	off	Output JSON

octomil run phi-4-mini "Explain edge computing in one sentence."
octomil run gemma3-1b "Summarize this." --max-tokens 128

octomil embed

Generate embeddings for text locally. No server or API key needed.

octomil embed <model> "<text>" [options]

Option	Default	Description
`--engine, -e`	auto	Force engine
`--output, -o`	stdout	Output file (JSON)

octomil embed nomic-embed-text "On-device AI inference at scale"
octomil embed nomic-embed-text "Hello world" --output embedding.json

octomil transcribe

Transcribe audio locally. No server or API key needed.

octomil transcribe <file> [options]

Option	Default	Description
`--model, -m`	`whisper-base`	Whisper model to use
`--language`	auto	Language hint
`--output, -o`	stdout	Output file

octomil transcribe recording.wav
octomil transcribe meeting.mp3 --model whisper-large-v3 --language en

octomil serve

Start a local OpenAI-compatible inference server.

octomil serve <model> [options]

Option	Default	Description
`--port, -p`	`8080`	Port to listen on
`--host`	`0.0.0.0`	Host to bind to
`--engine, -e`	auto	Force engine (`mlx-lm`, `llama.cpp`, `mnn`, `mlc-llm`, `onnxruntime`)
`--benchmark`	off	Run latency benchmark on startup
`--share`	off	Share anonymous benchmark data with Octomil Cloud
`--json-mode`	off	Default all responses to JSON output
`--cache-size`	`2048`	KV cache size in MB
`--no-cache`	off	Disable KV cache
`--max-queue`	`32`	Max pending requests in queue (0 to disable)
`--models`	-	Comma-separated models for multi-model serving
`--auto-route`	off	Enable automatic query routing (requires `--models`)
`--route-strategy`	`complexity`	Routing strategy for `--auto-route`

# Basic usage
octomil serve phi-mini

# Model with specific quantization (model:variant syntax)
octomil serve gemma-3b:4bit
octomil serve gemma-3b:8bit

# Specific engine + port
octomil serve gemma3-1b --engine llama.cpp --port 8080

# Multi-model with routing
octomil serve smollm-360m --models smollm-360m,phi-mini,llama-3b --auto-route

# JSON mode
octomil serve gemma3-1b --json-mode

# Speech-to-text (Whisper)
octomil serve whisper-base
octomil serve whisper-large-v3

API usage

Once the server is running, use any OpenAI-compatible client:

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-mini",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="phi-mini",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "phi-mini",
  messages: [{ role: "user", content: "Hello" }],
});

console.log(response.choices[0].message.content);

octomil benchmark

Run inference benchmarks on a model.

octomil benchmark <model> [options]

Option	Default	Description
`--local`	off	Keep results local (don't upload)
`--iterations, -n`	`10`	Number of inference iterations
`--max-tokens`	`50`	Max tokens per iteration
`--engine, -e`	auto	Force a specific engine
`--all-engines`	off	Benchmark all available engines

octomil benchmark gemma3-1b --all-engines
octomil benchmark phi-mini --iterations 20 --local

Every benchmark run contributes anonymous performance data to the Octomil community leaderboard — helping everyone find the fastest engine for their hardware. Your results improve routing decisions and benchmark rankings for all users.

Never shared: prompts, model outputs, model files/weights, IP address, device IDs, or user profile data.

Shared (anonymous hardware telemetry): model name, backend/runtime, platform + architecture, OS version, accelerator type, total RAM, iteration count, latency stats (avg/min/max/p50/p90/p95/p99), TTFT, TPOT, throughput, and peak memory.

Opt out anytime: pass --local to keep results on your machine, or disable telemetry globally in your dashboard settings. No data is uploaded without an active API key (octomil login or OCTOMIL_API_KEY).

octomil deploy

Deploy a model to devices.

octomil deploy <name> [options]

Option	Default	Description
`--version, -v`	latest	Version to deploy
`--phone`	off	Deploy to connected phone
`--rollout, -r`	`100`	Rollout percentage (1-100)
`--strategy, -s`	`canary`	Strategy: `canary`, `immediate`, `blue_green`
`--target, -t`	-	Target formats: `ios`, `android`
`--devices`	-	Comma-separated device IDs
`--group, -g`	-	Device group name
`--dry-run`	off	Preview without deploying

# Deploy to phone (app-install QR -> pairing QR -> multi-device select)
octomil deploy phi-4-mini --phone

# Canary rollout to 10%
octomil deploy sentiment-v1 --rollout 10 --strategy canary

# Target specific devices
octomil deploy gemma3-1b --devices device_1,device_2

# Dry run
octomil deploy gemma3-1b --group production --dry-run

Authenticate with Octomil Cloud.

octomil login [options]

Option	Default	Description
`--api-key`	-	Paste API key directly (skip browser)

# Browser-based (default)
octomil login

# Direct API key (CI/headless)
octomil login --api-key oct_sk_live_...

Or set the environment variable:

export OCTOMIL_API_KEY=oct_sk_live_...

octomil push

Upload model artifacts to the registry. Auto-downloads and converts if the model isn't local.

octomil push [path] --model-id <id> --version <version>

Option	Default	Description
`path` (positional)	-	Path to artifacts or model name (`phi-4-mini`, `hf:org/model`, `ollama:name`)
`--model-id, -m`	inferred	Model ID in the registry
`--version, -v`	required	Semantic version (e.g. 1.0.0)
`--quantize, -q`	-	Quantize models before pushing: `auto`, `int8`, `int4`, `dynamic`, `float16`
`--quality-threshold`	-	Reject quantized models if quality drops below this value (0.0-1.0)

# Push local artifacts
octomil push ./converted --model-id sentiment-v1 --version 1.0.0

# Auto-download, convert, and push (no local files needed)
octomil push phi-4-mini --version 1.0.0

# Explicit source
octomil push hf:microsoft/Phi-4-mini --version 1.0.0
octomil push ollama:phi4-mini --version 1.0.0

# Push with quantization
octomil push ./converted --model-id sentiment-v1 --version 2.0.0 --quantize int8

# Quantize with quality gate
octomil push phi-4-mini --version 1.0.0 --quantize auto --quality-threshold 0.95

octomil pull

Download a model from the registry.

octomil pull <name> [options]

Option	Default	Description
`--version, -v`	latest	Version to download
`--format`	-	Model format (`onnx`, `coreml`, `tflite`)
`--output, -o`	`.`	Output directory

octomil pull sentiment-v1 --version 1.0.0 --format coreml

octomil convert

Convert a model to edge formats locally. This is the primary conversion path -- all conversion runs on your machine, no server round-trip required.

octomil convert <model_path> [options]

Option	Default	Description
`--formats, -f`	`onnx`	Target formats: `onnx`, `coreml`, `tflite` (comma-separated)
`--output, -o`	`./converted`	Output directory
`--input-shape`	`1,3,224,224`	Input tensor shape
`--push`	off	Upload converted artifacts to the registry after conversion
`--validate` / `--no-validate`	`--validate`	Run validation checks on converted artifacts

# Convert locally
octomil convert model.pt --formats onnx,coreml,tflite --output converted_models

# Convert and push to registry in one step
octomil convert model.pt --formats onnx,coreml,tflite --push --model-id sentiment-v1 --version 1.0.0

# Skip validation (faster, for CI pipelines)
octomil convert model.pt --formats onnx,coreml --no-validate

octomil eval

Run quality evaluation comparing cloud vs on-device inference. Sends test inputs to the server's eval endpoint and reports whether the model meets a quality threshold.

octomil eval <model_id> --test-data <path> [options]

Option	Default	Description
`model_id` (positional)	required	Model ID to evaluate
`--test-data, -d`	required	Path to JSONL file with test inputs
`--threshold, -t`	`0.95`	Quality threshold (0.0-1.0)
`--api-base`	`http://localhost:8000`	Server API base URL (also reads `OCTOMIL_API_BASE`)
`--metrics, -m`	`similarity,exact_match,latency`	Comma-separated metrics to compute

The test data file is JSONL format. Each line is a JSON object with at least an "input" key and optionally an "expected_output" key:

{"input": "Great product, fast shipping", "expected_output": "positive"}
{"input": "Terrible experience, would not buy again", "expected_output": "negative"}
{"input": "It was okay", "expected_output": "neutral"}

# Basic quality eval
octomil eval sentiment-v1 --test-data tests.jsonl

# Custom threshold and metrics
octomil eval sentiment-v1 --test-data tests.jsonl --threshold 0.90 --metrics similarity,latency

# Against a remote server
octomil eval sentiment-v1 --test-data tests.jsonl --api-base https://api.octomil.com

The command exits with code 1 if the quality threshold is not met, making it suitable for CI pipelines. Output includes overall score, per-metric breakdowns, and statistical significance (p-value, effect size) when available.

octomil quantize

Quantize a model for edge deployment without pushing to the registry. Supports ONNX, TFLite, CoreML, and GGUF formats.

octomil quantize <model_path> [options]

Option	Default	Description
`model_path` (positional)	required	Path to model file (`.onnx`, `.tflite`, `.mlpackage`, `.mlmodel`, `.gguf`) or directory
`--method, -m`	`auto`	Quantization method: `auto`, `int8`, `int4`, `dynamic`, `float16`
`--output, -o`	`<model_dir>/quantized`	Output directory for quantized models
`--quality-threshold`	-	Reject quantized models if quality score drops below this value (0.0-1.0)

# Auto-select best quantization method
octomil quantize model.onnx

# Specific method
octomil quantize model.onnx --method int8

# Custom output directory
octomil quantize model.onnx --method int4 --output ./optimized

# With quality gate — rejects if quality drops too far
octomil quantize model.onnx --method auto --quality-threshold 0.95

# Quantize all models in a directory
octomil quantize ./models/ --method float16 --output ./quantized

Output reports size reduction, compression ratio, and quality scores (when --quality-threshold is set) for each processed file.

octomil check

Check device compatibility for a local model file.

octomil check <model_path> [options]

Option	Default	Description
`--devices, -d`	-	Device profiles (e.g. `iphone_15_pro,pixel_8`)

octomil check model.onnx --devices iphone_15_pro,pixel_8

octomil list models

List available models with their variants and supported engines.

octomil list models

Output shows all available models, quantization variants, and which engines support each:

Model               Variants          Engines
gemma3-1b            4bit, 8bit        mlx, mnn, mlc-llm, llama.cpp, onnxruntime
gemma3-4b            4bit, 8bit        mlx, mnn, mlc-llm, llama.cpp, onnxruntime
phi-4-mini          4bit, 8bit        mlx, mnn, mlc-llm, llama.cpp
llama-3.2-1b        4bit, 8bit        mlx, mnn, mlc-llm, llama.cpp, onnxruntime
llama-3.2-3b        4bit, 8bit        mlx, mnn, mlc-llm, llama.cpp, onnxruntime
whisper-tiny        fp16              whisper.cpp
whisper-base        fp16              whisper.cpp
whisper-small       fp16              whisper.cpp
whisper-medium      fp16              whisper.cpp
whisper-large-v3    fp16              whisper.cpp
...

octomil scan

Scan the local network for Octomil inference servers and devices.

octomil scan [options]

Option	Default	Description
`--timeout`	`5`	Scan timeout in seconds

octomil scan
# Found 2 Octomil instances:
#   192.168.1.42:8000 — phi-4-mini on mlx (58 tok/s)
#   192.168.1.100:8000 — gemma3-1b on llama.cpp (34 tok/s)

octomil status

Show deployment status for a model.

octomil status <name>

octomil dashboard

Open the Octomil dashboard in your browser.

octomil dashboard

octomil init

Initialize an Octomil organization for enterprise use.

octomil init <org_name> [options]

Option	Default	Description
`--compliance`	-	Compliance preset: `hipaa`, `gdpr`, `pci`, `soc2`
`--region`	`us`	Data region: `us`, `eu`, `ap`
`--api-base`	-	Override API base URL

octomil init "Acme Corp" --compliance hipaa --region us

octomil org

Show current organization info and settings.

octomil org

octomil demo code-assistant

Interactive code assistant powered by a local LLM.

octomil demo code-assistant [options]

Option	Default	Description
`--model, -m`	auto	Model to serve
`--url`	-	Connect to existing server
`--port, -p`	`8099`	Port for auto-started server
`--no-auto-start`	off	Don't auto-start server

octomil demo code-assistant
octomil demo code-assistant --model phi-mini

octomil launch

Launch a coding agent powered by a local model. Starts octomil serve in the background (if not already running) and configures the agent to use the local endpoint.

octomil launch <agent> [options]

Argument	Description
`claude`	Launch Claude Code with local backend
`codex`	Launch OpenAI Codex CLI
`openclaw`	Launch OpenClaw agent
`aider`	Launch Aider coding assistant

Option	Default	Description
`--model, -m`	`qwen3`	Model to serve
`--port, -p`	`8080`	Port for local server

octomil launch claude
octomil launch aider --model deepseek-coder-v2
octomil launch codex --model codestral

octomil models

List available models from ollama and the Octomil registry.

octomil models [options]

Option	Default	Description
`--source`	`all`	Filter source: `all`, `ollama`, `registry`

octomil models
octomil models --source ollama

octomil rollback

Rollback a model to a previous version.

octomil rollback <name> [options]

Option	Default	Description
`--to-version`	previous	Version to rollback to

# Rollback to the previous version
octomil rollback sentiment-v1

# Rollback to a specific version
octomil rollback sentiment-v1 --to-version 1.0.0

octomil pair

Connect to a pairing session as a device. Enter the code displayed by octomil deploy --phone to receive the model deployment.

octomil pair <code> [options]

Option	Default	Description
`--device-id`	auto	Device identifier
`--platform, -p`	auto	Device platform: `ios`, `android`, `python`
`--device-name`	-	Friendly device name

octomil pair ABC123
octomil pair ABC123 --device-name "Test iPhone"

octomil team

Manage organization team members.

octomil team <subcommand>

Subcommand	Description
`add <email>`	Invite a team member
`list`	List team members
`set-policy`	Set organization security policies

Option (add)	Default	Description
`--role`	`member`	Role: `admin`, `member`, `viewer`

Option (set-policy)	Default	Description
`--require-mfa`	off	Require MFA for all members
`--session-hours`	`24`	Session duration in hours

octomil team add alice@acme.com --role admin
octomil team list
octomil team set-policy --require-mfa --session-hours 8

octomil keys

Manage API keys.

octomil keys <subcommand>

Subcommand	Description
`create <name>`	Create a new API key
`list`	List API keys
`revoke <key_id>`	Revoke an API key

Option (create)	Default	Description
`--scope`	-	Permission scope (repeatable): `devices:read`, `devices:write`, `models:read`, `models:write`, `training:read`, `training:write`
`--expires`	-	Expiration (e.g. `30d`, `90d`)

octomil keys create deploy-key --scope devices:write --scope models:read
octomil keys list
octomil keys revoke key_abc123

octomil train

Federated training across deployed devices.

octomil train <subcommand>

Subcommand	Description
`start <model>`	Start federated training
`status <model>`	Show training progress
`stop <model>`	Stop active training

Option (start)	Default	Description
`--strategy`	`fedavg`	Aggregation strategy: `fedavg`, `fedprox`, `scaffold`, `krum`, `fedmedian`, `fedtrimmedavg`, `fedopt`, `fedadam`, `ditto`
`--rounds`	`10`	Number of training rounds
`--min-devices`	`2`	Minimum devices per round
`--group`	-	Device group to train with

octomil train start sentiment-v1 --strategy fedavg --rounds 50
octomil train start sentiment-v1 --strategy scaffold --group production
octomil train status sentiment-v1
octomil train stop sentiment-v1

octomil federation

Manage cross-organization federations.

octomil federation <subcommand>

Subcommand	Description
`create <name>`	Create a new federation
`invite <name> <org_ids>`	Invite organizations
`join <name>`	Join a federation
`list`	List federations
`show <name>`	Show federation details
`members <name>`	List federation members
`share <model> <federation>`	Share a model with a federation

octomil federation create "healthcare-consortium"
octomil federation invite "healthcare-consortium" org_123 org_456
octomil federation share phi-mini "healthcare-consortium"

octomil integrations

Manage observability export integrations (metrics + logs).

octomil integrations <subcommand>

Subcommand	Description
`list`	List all configured integrations
`create`	Create a metrics or log integration
`delete <id>`	Delete an integration
`test <id>`	Test an integration
`connect-otlp`	Connect an OTLP collector for both metrics and logs

Option (list)	Default	Description
`--type`	`all`	Filter: `metrics`, `logs`, `all`
`--json`	off	Output as JSON

Option (connect-otlp)	Default	Description
`--endpoint`	required	OTLP collector URL (e.g. `http://collector:4318`)
`--name`	`OTLP Collector`	Display name
`--headers-json`	-	Auth headers as JSON

# List all integrations
octomil integrations list

# Connect OTLP collector (recommended — configures metrics + logs)
octomil integrations connect-otlp --endpoint http://otel-collector:4318

# With auth headers
octomil integrations connect-otlp --endpoint https://otlp.grafana.net \
  --headers-json '{"Authorization": "Basic abc123"}'

# Create individual integrations
octomil integrations create --kind metrics --type prometheus --name prod-prom \
  --config-json '{"prefix": "octomil"}'

octomil integrations create --kind logs --type splunk --name prod-splunk \
  --endpoint https://splunk.example.com/services/collector --format hec

# Test and delete
octomil integrations test int_abc123 --kind metrics
octomil integrations delete int_abc123 --kind metrics

Environment variables

Variable	Description
`OCTOMIL_API_KEY`	Platform API key for Octomil Cloud (login, CLI auth, model registry)
`OCTOMIL_SERVER_KEY`	Server key for hosted inference (`/v1/chat/completions`, `/v1/embeddings`)
`OCTOMIL_API_BASE`	Override API base URL
`OCTOMIL_DASHBOARD_URL`	Dashboard URL for browser login (default: `https://app.octomil.com`)
`OCTOMIL_MODEL`	Default model for demo/serve

Config files

Path	Description
`~/.octomil/credentials`	API key + org from `octomil login` (JSON)
`~/.octomil/config.json`	Organization settings from `octomil init`
`~/.octomil/models/`	Downloaded model cache

Installation​

Commands​

octomil run​

octomil embed​

octomil transcribe​

octomil serve​

API usage​

octomil benchmark​

octomil deploy​

octomil login​

octomil push​

octomil pull​

octomil convert​

octomil eval​

octomil quantize​

octomil check​

octomil list models​

octomil scan​

octomil status​

octomil dashboard​

octomil init​

octomil org​

octomil demo code-assistant​

octomil launch​

octomil models​

octomil rollback​

octomil pair​

octomil team​

octomil keys​

octomil train​

octomil federation​

octomil integrations​

Environment variables​

Config files​