CLI Reference
Installation
- Binary
- Homebrew
- Windows
curl -fsSL https://octomil.com/install.sh | sh
Downloads a standalone binary — no Python required.
brew install octomil/tap/octomil
irm https://octomil.com/install.ps1 | iex
Verify:
octomil --version
Commands
octomil serve
Start a local OpenAI-compatible inference server.
octomil serve <model> [options]
| Option | Default | Description |
|---|---|---|
--port, -p | 8080 | Port to listen on |
--host | 0.0.0.0 | Host to bind to |
--engine, -e | auto | Force engine (mlx-lm, llama.cpp, mnn, mlc-llm, onnxruntime) |
--benchmark | off | Run latency benchmark on startup |
--share | off | Share anonymous benchmark data with Octomil Cloud |
--json-mode | off | Default all responses to JSON output |
--cache-size | 2048 | KV cache size in MB |
--no-cache | off | Disable KV cache |
--max-queue | 32 | Max pending requests in queue (0 to disable) |
--models | - | Comma-separated models for multi-model serving |
--auto-route | off | Enable automatic query routing (requires --models) |
--route-strategy | complexity | Routing strategy for --auto-route |
# Basic usage
octomil serve phi-mini
# Model with specific quantization (model:variant syntax)
octomil serve gemma-3b:4bit
octomil serve gemma-3b:8bit
# Specific engine + port
octomil serve gemma-1b --engine llama.cpp --port 8080
# Multi-model with routing
octomil serve smollm-360m --models smollm-360m,phi-mini,llama-3b --auto-route
# JSON mode
octomil serve gemma-1b --json-mode
# Speech-to-text (Whisper)
octomil serve whisper-base
octomil serve whisper-large-v3
API usage
Once the server is running, use any OpenAI-compatible client:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-mini",
"messages": [{"role": "user", "content": "Hello"}]
}'
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="phi-mini",
messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "not-needed",
});
const response = await client.chat.completions.create({
model: "phi-mini",
messages: [{ role: "user", content: "Hello" }],
});
console.log(response.choices[0].message.content);
octomil benchmark
Run inference benchmarks on a model.
octomil benchmark <model> [options]
| Option | Default | Description |
|---|---|---|
--local | off | Keep results local (don't upload) |
--iterations, -n | 10 | Number of inference iterations |
--max-tokens | 50 | Max tokens per iteration |
--engine, -e | auto | Force a specific engine |
--all-engines | off | Benchmark all available engines |
octomil benchmark gemma-1b --all-engines
octomil benchmark phi-mini --iterations 20 --local
Every benchmark run contributes anonymous performance data to the Octomil community leaderboard — helping everyone find the fastest engine for their hardware. Your results improve routing decisions and benchmark rankings for all users.
Never shared: prompts, model outputs, model files/weights, IP address, device IDs, or user profile data.
Shared (anonymous hardware telemetry): model name, backend/runtime, platform + architecture, OS version, accelerator type, total RAM, iteration count, latency stats (avg/min/max/p50/p90/p95/p99), TTFT, TPOT, throughput, and peak memory.
Opt out anytime: pass --local to keep results on your machine, or
disable telemetry globally in your dashboard settings. No data is uploaded
without an active API key (octomil login or OCTOMIL_API_KEY).
octomil deploy
Deploy a model to devices.
octomil deploy <name> [options]
| Option | Default | Description |
|---|---|---|
--version, -v | latest | Version to deploy |
--phone | off | Deploy to connected phone |
--rollout, -r | 100 | Rollout percentage (1-100) |
--strategy, -s | canary | Strategy: canary, immediate, blue_green |
--target, -t | - | Target formats: ios, android |
--devices | - | Comma-separated device IDs |
--group, -g | - | Device group name |
--dry-run | off | Preview without deploying |
# Deploy to phone (app-install QR -> pairing QR -> multi-device select)
octomil deploy phi-4-mini --phone
# Canary rollout to 10%
octomil deploy sentiment-v1 --rollout 10 --strategy canary
# Target specific devices
octomil deploy gemma-1b --devices device_1,device_2
# Dry run
octomil deploy gemma-1b --group production --dry-run
octomil login
Authenticate with Octomil Cloud.
octomil login [options]
| Option | Default | Description |
|---|---|---|
--api-key | - | Paste API key directly (skip browser) |
# Browser-based (default)
octomil login
# Direct API key (CI/headless)
octomil login --api-key edg_...
Or set the environment variable:
export OCTOMIL_API_KEY=edg_...
octomil push
Upload model artifacts to the registry. Auto-downloads and converts if the model isn't local.
octomil push [path] --model-id <id> --version <version>
| Option | Default | Description |
|---|---|---|
path (positional) | - | Path to artifacts or model name (phi-4-mini, hf:org/model, ollama:name) |
--model-id, -m | inferred | Model ID in the registry |
--version, -v | required | Semantic version (e.g. 1.0.0) |
--quantize, -q | - | Quantize models before pushing: auto, int8, int4, dynamic, float16 |
--quality-threshold | - | Reject quantized models if quality drops below this value (0.0-1.0) |
# Push local artifacts
octomil push ./converted --model-id sentiment-v1 --version 1.0.0
# Auto-download, convert, and push (no local files needed)
octomil push phi-4-mini --version 1.0.0
# Explicit source
octomil push hf:microsoft/Phi-4-mini --version 1.0.0
octomil push ollama:phi4-mini --version 1.0.0
# Push with quantization
octomil push ./converted --model-id sentiment-v1 --version 2.0.0 --quantize int8
# Quantize with quality gate
octomil push phi-4-mini --version 1.0.0 --quantize auto --quality-threshold 0.95
octomil pull
Download a model from the registry.
octomil pull <name> [options]
| Option | Default | Description |
|---|---|---|
--version, -v | latest | Version to download |
--format | - | Model format (onnx, coreml, tflite) |
--output, -o | . | Output directory |
octomil pull sentiment-v1 --version 1.0.0 --format coreml
octomil convert
Convert a model to edge formats locally. This is the primary conversion path -- all conversion runs on your machine, no server round-trip required.
octomil convert <model_path> [options]
| Option | Default | Description |
|---|---|---|
--formats, -f | onnx | Target formats: onnx, coreml, tflite (comma-separated) |
--output, -o | ./converted | Output directory |
--input-shape | 1,3,224,224 | Input tensor shape |
--push | off | Upload converted artifacts to the registry after conversion |
--validate / --no-validate | --validate | Run validation checks on converted artifacts |
# Convert locally
octomil convert model.pt --formats onnx,coreml,tflite --output converted_models
# Convert and push to registry in one step
octomil convert model.pt --formats onnx,coreml,tflite --push --model-id sentiment-v1 --version 1.0.0
# Skip validation (faster, for CI pipelines)
octomil convert model.pt --formats onnx,coreml --no-validate
octomil eval
Run quality evaluation comparing cloud vs on-device inference. Sends test inputs to the server's eval endpoint and reports whether the model meets a quality threshold.
octomil eval <model_id> --test-data <path> [options]
| Option | Default | Description |
|---|---|---|
model_id (positional) | required | Model ID to evaluate |
--test-data, -d | required | Path to JSONL file with test inputs |
--threshold, -t | 0.95 | Quality threshold (0.0-1.0) |
--api-base | http://localhost:8000 | Server API base URL (also reads OCTOMIL_API_BASE) |
--metrics, -m | similarity,exact_match,latency | Comma-separated metrics to compute |
The test data file is JSONL format. Each line is a JSON object with at least an "input" key and optionally an "expected_output" key:
{"input": "Great product, fast shipping", "expected_output": "positive"}
{"input": "Terrible experience, would not buy again", "expected_output": "negative"}
{"input": "It was okay", "expected_output": "neutral"}
# Basic quality eval
octomil eval sentiment-v1 --test-data tests.jsonl
# Custom threshold and metrics
octomil eval sentiment-v1 --test-data tests.jsonl --threshold 0.90 --metrics similarity,latency
# Against a remote server
octomil eval sentiment-v1 --test-data tests.jsonl --api-base https://api.octomil.com
The command exits with code 1 if the quality threshold is not met, making it suitable for CI pipelines. Output includes overall score, per-metric breakdowns, and statistical significance (p-value, effect size) when available.
octomil quantize
Quantize a model for edge deployment without pushing to the registry. Supports ONNX, TFLite, CoreML, and GGUF formats.
octomil quantize <model_path> [options]
| Option | Default | Description |
|---|---|---|
model_path (positional) | required | Path to model file (.onnx, .tflite, .mlpackage, .mlmodel, .gguf) or directory |
--method, -m | auto | Quantization method: auto, int8, int4, dynamic, float16 |
--output, -o | <model_dir>/quantized | Output directory for quantized models |
--quality-threshold | - | Reject quantized models if quality score drops below this value (0.0-1.0) |
# Auto-select best quantization method
octomil quantize model.onnx
# Specific method
octomil quantize model.onnx --method int8
# Custom output directory
octomil quantize model.onnx --method int4 --output ./optimized
# With quality gate — rejects if quality drops too far
octomil quantize model.onnx --method auto --quality-threshold 0.95
# Quantize all models in a directory
octomil quantize ./models/ --method float16 --output ./quantized
Output reports size reduction, compression ratio, and quality scores (when --quality-threshold is set) for each processed file.
octomil check
Check device compatibility for a local model file.
octomil check <model_path> [options]
| Option | Default | Description |
|---|---|---|
--devices, -d | - | Device profiles (e.g. iphone_15_pro,pixel_8) |
octomil check model.onnx --devices iphone_15_pro,pixel_8
octomil list models
List available models with their variants and supported engines.
octomil list models
Output shows all available models, quantization variants, and which engines support each:
Model Variants Engines
gemma-1b 4bit, 8bit mlx, mnn, mlc-llm, llama.cpp, onnxruntime
gemma-4b 4bit, 8bit mlx, mnn, mlc-llm, llama.cpp, onnxruntime
phi-4-mini 4bit, 8bit mlx, mnn, mlc-llm, llama.cpp
llama-3.2-1b 4bit, 8bit mlx, mnn, mlc-llm, llama.cpp, onnxruntime
llama-3.2-3b 4bit, 8bit mlx, mnn, mlc-llm, llama.cpp, onnxruntime
whisper-tiny fp16 whisper.cpp
whisper-base fp16 whisper.cpp
whisper-small fp16 whisper.cpp
whisper-medium fp16 whisper.cpp
whisper-large-v3 fp16 whisper.cpp
...
octomil scan
Scan the local network for Octomil inference servers and devices.
octomil scan [options]
| Option | Default | Description |
|---|---|---|
--timeout | 5 | Scan timeout in seconds |
octomil scan
# Found 2 Octomil instances:
# 192.168.1.42:8000 — phi-4-mini on mlx (58 tok/s)
# 192.168.1.100:8000 — gemma-1b on llama.cpp (34 tok/s)
octomil status
Show deployment status for a model.
octomil status <name>
octomil dashboard
Open the Octomil dashboard in your browser.
octomil dashboard
octomil init
Initialize an Octomil organization for enterprise use.
octomil init <org_name> [options]
| Option | Default | Description |
|---|---|---|
--compliance | - | Compliance preset: hipaa, gdpr, pci, soc2 |
--region | us | Data region: us, eu, ap |
--api-base | - | Override API base URL |
octomil init "Acme Corp" --compliance hipaa --region us
octomil org
Show current organization info and settings.
octomil org
octomil demo code-assistant
Interactive code assistant powered by a local LLM.
octomil demo code-assistant [options]
| Option | Default | Description |
|---|---|---|
--model, -m | auto | Model to serve |
--url | - | Connect to existing server |
--port, -p | 8099 | Port for auto-started server |
--no-auto-start | off | Don't auto-start server |
octomil demo code-assistant
octomil demo code-assistant --model phi-mini
octomil launch
Launch a coding agent powered by a local model. Starts octomil serve in the background (if not already running) and configures the agent to use the local endpoint.
octomil launch <agent> [options]
| Argument | Description |
|---|---|
claude | Launch Claude Code with local backend |
codex | Launch OpenAI Codex CLI |
openclaw | Launch OpenClaw agent |
aider | Launch Aider coding assistant |
| Option | Default | Description |
|---|---|---|
--model, -m | qwen3 | Model to serve |
--port, -p | 8080 | Port for local server |
octomil launch claude
octomil launch aider --model deepseek-coder-v2
octomil launch codex --model codestral
octomil models
List available models from ollama and the Octomil registry.
octomil models [options]
| Option | Default | Description |
|---|---|---|
--source | all | Filter source: all, ollama, registry |
octomil models
octomil models --source ollama
octomil rollback
Rollback a model to a previous version.
octomil rollback <name> [options]
| Option | Default | Description |
|---|---|---|
--to-version | previous | Version to rollback to |
# Rollback to the previous version
octomil rollback sentiment-v1
# Rollback to a specific version
octomil rollback sentiment-v1 --to-version 1.0.0
octomil pair
Connect to a pairing session as a device. Enter the code displayed by octomil deploy --phone to receive the model deployment.
octomil pair <code> [options]
| Option | Default | Description |
|---|---|---|
--device-id | auto | Device identifier |
--platform, -p | auto | Device platform: ios, android, python |
--device-name | - | Friendly device name |
octomil pair ABC123
octomil pair ABC123 --device-name "Test iPhone"
octomil team
Manage organization team members.
octomil team <subcommand>
| Subcommand | Description |
|---|---|
add <email> | Invite a team member |
list | List team members |
set-policy | Set organization security policies |
| Option (add) | Default | Description |
|---|---|---|
--role | member | Role: admin, member, viewer |
| Option (set-policy) | Default | Description |
|---|---|---|
--require-mfa | off | Require MFA for all members |
--session-hours | 24 | Session duration in hours |
octomil team add alice@acme.com --role admin
octomil team list
octomil team set-policy --require-mfa --session-hours 8
octomil keys
Manage API keys.
octomil keys <subcommand>
| Subcommand | Description |
|---|---|
create <name> | Create a new API key |
list | List API keys |
revoke <key_id> | Revoke an API key |
| Option (create) | Default | Description |
|---|---|---|
--scope | - | Permission scope (repeatable): devices:read, devices:write, models:read, models:write, training:read, training:write |
--expires | - | Expiration (e.g. 30d, 90d) |
octomil keys create deploy-key --scope devices:write --scope models:read
octomil keys list
octomil keys revoke key_abc123
octomil train
Federated training across deployed devices.
octomil train <subcommand>
| Subcommand | Description |
|---|---|
start <model> | Start federated training |
status <model> | Show training progress |
stop <model> | Stop active training |
| Option (start) | Default | Description |
|---|---|---|
--strategy | fedavg | Aggregation strategy: fedavg, fedprox, scaffold, krum, fedmedian, fedtrimmedavg, fedopt, fedadam, ditto |
--rounds | 10 | Number of training rounds |
--min-devices | 2 | Minimum devices per round |
--group | - | Device group to train with |
octomil train start sentiment-v1 --strategy fedavg --rounds 50
octomil train start sentiment-v1 --strategy scaffold --group production
octomil train status sentiment-v1
octomil train stop sentiment-v1
octomil federation
Manage cross-organization federations.
octomil federation <subcommand>
| Subcommand | Description |
|---|---|
create <name> | Create a new federation |
invite <name> <org_ids> | Invite organizations |
join <name> | Join a federation |
list | List federations |
show <name> | Show federation details |
members <name> | List federation members |
share <model> <federation> | Share a model with a federation |
octomil federation create "healthcare-consortium"
octomil federation invite "healthcare-consortium" org_123 org_456
octomil federation share phi-mini "healthcare-consortium"
octomil integrations
Manage observability export integrations (metrics + logs).
octomil integrations <subcommand>
| Subcommand | Description |
|---|---|
list | List all configured integrations |
create | Create a metrics or log integration |
delete <id> | Delete an integration |
test <id> | Test an integration |
connect-otlp | Connect an OTLP collector for both metrics and logs |
| Option (list) | Default | Description |
|---|---|---|
--type | all | Filter: metrics, logs, all |
--json | off | Output as JSON |
| Option (connect-otlp) | Default | Description |
|---|---|---|
--endpoint | required | OTLP collector URL (e.g. http://collector:4318) |
--name | OTLP Collector | Display name |
--headers-json | - | Auth headers as JSON |
# List all integrations
octomil integrations list
# Connect OTLP collector (recommended — configures metrics + logs)
octomil integrations connect-otlp --endpoint http://otel-collector:4318
# With auth headers
octomil integrations connect-otlp --endpoint https://otlp.grafana.net \
--headers-json '{"Authorization": "Basic abc123"}'
# Create individual integrations
octomil integrations create --kind metrics --type prometheus --name prod-prom \
--config-json '{"prefix": "octomil"}'
octomil integrations create --kind logs --type splunk --name prod-splunk \
--endpoint https://splunk.example.com/services/collector --format hec
# Test and delete
octomil integrations test int_abc123 --kind metrics
octomil integrations delete int_abc123 --kind metrics
Environment variables
| Variable | Description |
|---|---|
OCTOMIL_API_KEY | API key for Octomil Cloud |
OCTOMIL_API_BASE | Override API base URL |
OCTOMIL_DASHBOARD_URL | Dashboard URL for browser login (default: https://app.octomil.com) |
OCTOMIL_MODEL | Default model for demo/serve |
Config files
| Path | Description |
|---|---|
~/.octomil/credentials | API key + org from octomil login (JSON) |
~/.octomil/config.json | Organization settings from octomil init |
~/.octomil/models/ | Downloaded model cache |