Device Profiling
Octomil maintains a database of real-world inference performance data across device types, collected anonymously from octomil benchmark runs when sharing is enabled (default behavior with an API key). This data powers deployment recommendations, routing decisions, and compatibility checks -- so when you ask "will this model run well on an iPhone 14?", the answer comes from measured data, not estimates.
The Data Flywheel
Every shared octomil benchmark run contributes anonymous performance data. More submissions produce better recommendations for everyone. No personally identifiable information is collected -- submissions include runtime/hardware characteristics and performance numbers only.
You run a benchmark
|
v
octomil benchmark --> Anonymous data submitted*
| |
v v
Your local results Aggregated into device profiles
|
v
Better recommendations for
routing, compatibility, and
deployment decisions
* Upload occurs only when an API key is configured. Use --local to skip upload.
Running a Benchmark
octomil benchmark phi-4-mini
Output:
Model: phi-4-mini (3.8B params)
Engine: mlx
Platform: darwin-arm64 (Apple M3 Pro)
RAM: 18 GB
Iterations: 10
Latency:
avg: 55.2 ms
p50: 54.8 ms
p95: 58.1 ms
p99: 61.3 ms
Throughput:
avg: 38.7 tok/s
peak: 42.1 tok/s
Token Timing:
TTFT: 142.3 ms
TPOT: 25.8 ms
Memory:
peak: 4.2 GB
Sharing Results
octomil benchmark uploads anonymous metrics by default when an API key is present.
If no API key is configured, nothing is uploaded. Use --local to force local-only mode.
octomil benchmark phi-4-mini # default: share if logged in
octomil benchmark phi-4-mini --local
Shared fields: model name, backend/runtime, platform, CPU architecture, OS version, accelerator type, total RAM, iteration count, prompt/completion token counts, latency distribution (avg/min/max/p50/p90/p95/p99), TTFT, TPOT, throughput, and peak memory. Not shared: prompts, model outputs, model files/weights, IP addresses, device IDs, or user profile data.
Benchmark Options
| Option | Default | Description |
|---|---|---|
--local | off | Keep results local (no upload) |
--iterations | 10 | Number of benchmark iterations |
--max-tokens | 50 | Max tokens per iteration |
--engine | auto | Force a specific engine |
--all-engines | off | Benchmark all available engines |
Device Profiles API
Aggregated performance data is available via the API.
GET /api/v1/benchmarks/leaderboard
Retrieve the performance leaderboard -- best throughput per model, platform, and backend combination.
- cURL
- Python
- JavaScript
curl https://api.octomil.com/api/v1/benchmarks/leaderboard?model=phi-4-mini
import requests
response = requests.get(
"https://api.octomil.com/api/v1/benchmarks/leaderboard",
params={"model": "phi-4-mini"},
)
print(response.json())
const response = await fetch(
"https://api.octomil.com/api/v1/benchmarks/leaderboard?model=phi-4-mini"
);
const data = await response.json();
console.log(data);
[
{
"model": "phi-4-mini",
"backend": "mlx",
"platform": "darwin",
"arch": "arm64",
"avg_tokens_per_second": 41.3,
"avg_latency_ms": 52.8,
"avg_ttft_ms": 138.4,
"avg_tpot_ms": 24.2,
"avg_p99_latency_ms": 59.7,
"submissions": 847
},
{
"model": "phi-4-mini",
"backend": "llamacpp",
"platform": "linux",
"arch": "x86_64",
"avg_tokens_per_second": 28.6,
"avg_latency_ms": 71.4,
"avg_ttft_ms": 195.2,
"avg_tpot_ms": 35.0,
"avg_p99_latency_ms": 88.1,
"submissions": 312
}
]
GET /api/v1/benchmarks
List individual benchmark submissions. Filter by model, platform, or backend.
- cURL
- Python
- JavaScript
curl "https://api.octomil.com/api/v1/benchmarks?model=phi-4-mini&platform=darwin&limit=10"
import requests
response = requests.get(
"https://api.octomil.com/api/v1/benchmarks",
params={"model": "phi-4-mini", "platform": "darwin", "limit": 10},
)
print(response.json())
const params = new URLSearchParams({
model: "phi-4-mini",
platform: "darwin",
limit: "10",
});
const response = await fetch(
`https://api.octomil.com/api/v1/benchmarks?${params}`
);
const data = await response.json();
console.log(data);
POST /api/v1/benchmarks
Submit benchmark results programmatically (used by CLI sharing when benchmarks are uploaded).
- cURL
- Python
- JavaScript
curl -X POST https://api.octomil.com/api/v1/benchmarks \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"model_size_bytes": 2147483648,
"quantization": "q4_k_m",
"backend": "mlx",
"platform": "darwin",
"arch": "arm64",
"os_version": "15.3",
"ram_total_bytes": 18253611008,
"iterations": 10,
"prompt_tokens": 32,
"completion_tokens": 128,
"avg_latency_ms": 55.2,
"p50_latency_ms": 54.8,
"p95_latency_ms": 58.1,
"p99_latency_ms": 61.3,
"ttft_ms": 142.3,
"tpot_ms": 25.8,
"avg_tokens_per_second": 38.7,
"peak_tokens_per_second": 42.1,
"peak_memory_bytes": 4509715456,
"source": "manual"
}'
import requests
response = requests.post(
"https://api.octomil.com/api/v1/benchmarks",
headers={"Authorization": "Bearer <token>"},
json={
"model": "phi-4-mini",
"model_size_bytes": 2147483648,
"quantization": "q4_k_m",
"backend": "mlx",
"platform": "darwin",
"arch": "arm64",
"os_version": "15.3",
"ram_total_bytes": 18253611008,
"iterations": 10,
"prompt_tokens": 32,
"completion_tokens": 128,
"avg_latency_ms": 55.2,
"p50_latency_ms": 54.8,
"p95_latency_ms": 58.1,
"p99_latency_ms": 61.3,
"ttft_ms": 142.3,
"tpot_ms": 25.8,
"avg_tokens_per_second": 38.7,
"peak_tokens_per_second": 42.1,
"peak_memory_bytes": 4509715456,
"source": "manual",
},
)
print(response.json())
const response = await fetch("https://api.octomil.com/api/v1/benchmarks", {
method: "POST",
headers: {
"Authorization": "Bearer <token>",
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "phi-4-mini",
model_size_bytes: 2147483648,
quantization: "q4_k_m",
backend: "mlx",
platform: "darwin",
arch: "arm64",
os_version: "15.3",
ram_total_bytes: 18253611008,
iterations: 10,
prompt_tokens: 32,
completion_tokens: 128,
avg_latency_ms: 55.2,
p50_latency_ms: 54.8,
p95_latency_ms: 58.1,
p99_latency_ms: 61.3,
ttft_ms: 142.3,
tpot_ms: 25.8,
avg_tokens_per_second: 38.7,
peak_tokens_per_second: 42.1,
peak_memory_bytes: 4509715456,
source: "manual",
}),
});
const data = await response.json();
console.log(data);
How Profiles Are Used
Device profiling data feeds into multiple Octomil systems:
Deployment Orchestration
When you deploy a model to your device fleet, Octomil checks measured performance data for each device class. If real benchmark data exists for a device/model combination, deployment plans use data_source: "measured" instead of "estimated":
{
"device_id": "device-abc",
"format": "coreml",
"executor": "ane",
"quantization": "int4",
"runtime_config": {
"data_source": "measured",
"expected_throughput_tok_s": 38.7,
"expected_latency_ms": 55.2
}
}
Measured data produces more accurate deployment decisions and reduces the risk of deploying models to devices that can't handle them well.
Routing Decisions
The model routing engine uses profiling data to estimate on-device latency and determine whether a model should run locally or fall back to cloud inference.
Move-to-Device Recommendations
The recommendation engine uses fleet-wide benchmark data to assess device compatibility when recommending models for on-device deployment.
Dashboard
The benchmark leaderboard is visible in the Octomil dashboard. The leaderboard shows:
- Performance rankings by model, grouped by platform and backend
- Throughput and latency distributions across submissions
- Device coverage -- how many unique device types have contributed data for each model
- Trends over time as new hardware and runtime versions are tested
Gotchas
- First benchmark downloads the model — if the model isn't cached locally, the first run includes download time. Run
octomil pull <model>first for accurate timing. - Background processes affect results — close other applications before benchmarking. Background CPU/GPU load skews latency and throughput numbers.
- Thermal throttling on mobile — running multiple benchmark iterations on a phone can trigger thermal throttling, producing progressively slower results. Use
--iterations 5on mobile. - Shared benchmark submissions are anonymous but permanent — uploaded benchmark records cannot currently be retracted. Use
--localfor private runs or while testing non-representative configurations. - Leaderboard averages across submissions — a single outlier submission doesn't dominate. The leaderboard shows the mean across all submissions for a given model/platform/backend tuple.
- Benchmark ≠ production performance — benchmarks use synthetic prompts with fixed token counts. Real-world performance varies with prompt length, concurrent requests, and memory pressure.
Related
- Local Inference — run models locally
- Model Routing — routing decisions powered by profiling data
- Device Targeting — deployment recommendations using fleet data
- Speculative Decoding — benchmark speculative vs standard performance
- Observability — inference metrics reported to dashboard