Quickstart
Get up and running with on-device inference in 5 minutes.
Four steps. Two shell commands. A few lines of code.
1. Install the CLI ~30s
- macOS / Linux
- Homebrew
- Windows
curl -fsSL https://octomil.com/install.sh | sh
brew install octomil/tap/octomil
irm https://octomil.com/install.ps1 | iex
2. Sign in ~30s
octomil login
Your browser opens. Sign in with Google or create a passkey — the CLI receives your credentials automatically.
See your model on a phone before writing any code. Run octomil deploy phi-4-mini --phone — a QR code appears, scan it, and the model runs on your device.
In headless environments (CI, SSH, containers), octomil login falls back to a manual API key prompt. You can also pass --api-key directly or set OCTOMIL_API_KEY.
3. Push a model ~1 min
octomil push phi-4-mini
The CLI imports phi-4-mini from HuggingFace into your Octomil registry. The server handles conversion to edge formats (CoreML, TFLite). After import, it prints a ready-to-paste SDK snippet with your real apiKey and orgId.
Already have a local model file? Upload it directly:
octomil push ./model.safetensors --model-id phi-4-mini
Add --quantize to automatically optimize your model for on-device performance (e.g., octomil push phi-4-mini --quantize). Octomil selects the best quantization strategy based on target hardware. See Model Optimizer for details.
Pin a specific source with hf:org/model. Version defaults to 1.0.0 — override with --version 2.0.0.
Smart Routing — Once pushed, Octomil automatically routes inference between device and cloud based on model capabilities, input complexity, and device resources. Common queries run on-device; hard queries fall back to the cloud with no code changes. See Smart Routing for configuration options.
4. Add the SDK to your app ~2 min
One dependency. No config files.
- iOS (Swift)
- Android (Kotlin)
- Python
- Node.js
Add the package URL in Xcode:
https://github.com/octomil-ai/octomil-ios.git
Or in Package.swift:
dependencies: [
.package(url: "https://github.com/octomil-ai/octomil-ios.git", from: "1.0.0")
]
Requires iOS 15+.
In build.gradle.kts:
dependencies {
implementation("ai.octomil:octomil-android:1.0.0")
implementation("com.google.ai.edge.litert:litert:2.0.0")
}
Requires API 26+ (Android 8.0).
pip install octomil
Requires Python 3.9+.
pnpm install @octomil/sdk
Requires Node.js 18+.
Initialize the client, download the model, and run inference:
- iOS (Swift)
- Android (Kotlin)
- Python
- Node.js
import Octomil
let client = OctomilClient(apiKey: "edg_...", orgId: "your-org-id")
try await client.register()
let model = try await client.downloadModel(modelId: "phi-4-mini")
let result = try model.predict(input: ["features": inputData])
import ai.octomil.OctomilClient
val client = OctomilClient(apiKey = "edg_...", orgId = "your-org-id", context = this)
client.register()
val model = client.downloadModel(modelId = "phi-4-mini")
val result = model.predict(mapOf("features" to inputData))
import octomil
client = octomil.Client(api_key="edg_...", org_id="your-org-id")
text = client.predict("phi-4-mini", [{"role": "user", "content": "Hello"}])
print(text)
import { OctomilClient } from "@octomil/sdk";
const client = new OctomilClient({ apiKey: "edg_...", orgId: "your-org-id" });
const result = await client.predict("phi-4-mini", { text: "Hello" });
console.log(result.label, result.scores);
octomil push prints a ready-to-paste SDK snippet with your real apiKey and orgId after every import.
The model caches on-device — subsequent launches are instant. Inference runs entirely on the device: zero latency, zero cost, fully offline.
Streaming Inference — Building an LLM-powered app? Use model.stream() instead of model.predict() to receive tokens as they are generated, enabling real-time chat UIs and progressive output. See Streaming Inference for usage and examples.
Embeddings — Need vector representations? Call model.embed(input) alongside predict() to generate embeddings for search, RAG, or similarity features. See Embeddings for supported models and dimensions.
Your model is live
Devices download it on first launch and run inference fully offline. Here's what to do next.
Monitor your fleet
octomil dashboard
Opens app.octomil.com — device status, inference latency, error rates, and resource usage in real time.
- Monitoring dashboard — metrics, alerts, fleet health
Evaluate model quality
Verify that your on-device model matches cloud quality before rolling out widely:
octomil eval phi-4-mini --dataset my-eval-set
Runs your evaluation suite against the deployed model and reports accuracy, latency, and regression metrics. See Quality Evaluation for configuring eval datasets and thresholds.
Deploy to more devices
Roll your model out to a device group or your entire fleet:
octomil deploy phi-4-mini --group production --rollout 25
Start with 25% canary, watch metrics, then ramp to 100%. Roll back instantly if anything looks off.
- Rollouts — canary, blue-green, and immediate strategies
- iOS SDK — background updates, pairing deep links, CoreML options
- Android SDK — NNAPI acceleration, WorkManager scheduling
Train better models
Improve your model using on-device data — without it ever leaving the device:
octomil train phi-4-mini --group production --rounds 10
Federated learning aggregates gradient updates from your fleet. The model improves every round while user data stays private.
- Federated training — aggregation strategies, privacy, convergence
- Personalization — per-device fine-tuning with Ditto (Python SDK)
Test what works
Compare model versions with A/B experiments across your fleet:
octomil experiment create --model-a phi-4-mini:v1 --model-b phi-4-mini:v2 --traffic 50
- Experiments — traffic splits, metrics, significance testing
Ship it
Once you have a winner, promote it to your entire fleet:
octomil deploy phi-4-mini --version 2.0.0 --strategy immediate
- CLI Reference — every command and flag
- Supported Models — what runs, on what hardware