Skip to main content

Device Profiling

Octomil maintains a database of real-world inference performance data across device types, collected anonymously from octomil benchmark runs when sharing is enabled (default behavior with an API key). This data powers deployment recommendations, routing decisions, and compatibility checks -- so when you ask "will this model run well on an iPhone 14?", the answer comes from measured data, not estimates.

The Data Flywheel

Every shared octomil benchmark run contributes anonymous performance data. More submissions produce better recommendations for everyone. No personally identifiable information is collected -- submissions include runtime/hardware characteristics and performance numbers only.

You run a benchmark
|
v
octomil benchmark --> Anonymous data submitted*
| |
v v
Your local results Aggregated into device profiles
|
v
Better recommendations for
routing, compatibility, and
deployment decisions

* Upload occurs only when an API key is configured. Use --local to skip upload.

Running a Benchmark

octomil benchmark phi-4-mini

Output:

Model:         phi-4-mini (3.8B params)
Engine: mlx
Platform: darwin-arm64 (Apple M3 Pro)
RAM: 18 GB
Iterations: 10

Latency:
avg: 55.2 ms
p50: 54.8 ms
p95: 58.1 ms
p99: 61.3 ms

Throughput:
avg: 38.7 tok/s
peak: 42.1 tok/s

Token Timing:
TTFT: 142.3 ms
TPOT: 25.8 ms

Memory:
peak: 4.2 GB

Sharing Results

octomil benchmark uploads anonymous metrics by default when an API key is present. If no API key is configured, nothing is uploaded. Use --local to force local-only mode.

octomil benchmark phi-4-mini        # default: share if logged in
octomil benchmark phi-4-mini --local

Shared fields: model name, backend/runtime, platform, CPU architecture, OS version, accelerator type, total RAM, iteration count, prompt/completion token counts, latency distribution (avg/min/max/p50/p90/p95/p99), TTFT, TPOT, throughput, and peak memory. Not shared: prompts, model outputs, model files/weights, IP addresses, device IDs, or user profile data.

Benchmark Options

OptionDefaultDescription
--localoffKeep results local (no upload)
--iterations10Number of benchmark iterations
--max-tokens50Max tokens per iteration
--engineautoForce a specific engine
--all-enginesoffBenchmark all available engines

Device Profiles API

Aggregated performance data is available via the API.

GET /api/v1/benchmarks/leaderboard

Retrieve the performance leaderboard -- best throughput per model, platform, and backend combination.

curl https://api.octomil.com/api/v1/benchmarks/leaderboard?model=phi-4-mini
[
{
"model": "phi-4-mini",
"backend": "mlx",
"platform": "darwin",
"arch": "arm64",
"avg_tokens_per_second": 41.3,
"avg_latency_ms": 52.8,
"avg_ttft_ms": 138.4,
"avg_tpot_ms": 24.2,
"avg_p99_latency_ms": 59.7,
"submissions": 847
},
{
"model": "phi-4-mini",
"backend": "llamacpp",
"platform": "linux",
"arch": "x86_64",
"avg_tokens_per_second": 28.6,
"avg_latency_ms": 71.4,
"avg_ttft_ms": 195.2,
"avg_tpot_ms": 35.0,
"avg_p99_latency_ms": 88.1,
"submissions": 312
}
]

GET /api/v1/benchmarks

List individual benchmark submissions. Filter by model, platform, or backend.

curl "https://api.octomil.com/api/v1/benchmarks?model=phi-4-mini&platform=darwin&limit=10"

POST /api/v1/benchmarks

Submit benchmark results programmatically (used by CLI sharing when benchmarks are uploaded).

curl -X POST https://api.octomil.com/api/v1/benchmarks \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"model_size_bytes": 2147483648,
"quantization": "q4_k_m",
"backend": "mlx",
"platform": "darwin",
"arch": "arm64",
"os_version": "15.3",
"ram_total_bytes": 18253611008,
"iterations": 10,
"prompt_tokens": 32,
"completion_tokens": 128,
"avg_latency_ms": 55.2,
"p50_latency_ms": 54.8,
"p95_latency_ms": 58.1,
"p99_latency_ms": 61.3,
"ttft_ms": 142.3,
"tpot_ms": 25.8,
"avg_tokens_per_second": 38.7,
"peak_tokens_per_second": 42.1,
"peak_memory_bytes": 4509715456,
"source": "manual"
}'

How Profiles Are Used

Device profiling data feeds into multiple Octomil systems:

Deployment Orchestration

When you deploy a model to your device fleet, Octomil checks measured performance data for each device class. If real benchmark data exists for a device/model combination, deployment plans use data_source: "measured" instead of "estimated":

{
"device_id": "device-abc",
"format": "coreml",
"executor": "ane",
"quantization": "int4",
"runtime_config": {
"data_source": "measured",
"expected_throughput_tok_s": 38.7,
"expected_latency_ms": 55.2
}
}

Measured data produces more accurate deployment decisions and reduces the risk of deploying models to devices that can't handle them well.

Routing Decisions

The model routing engine uses profiling data to estimate on-device latency and determine whether a model should run locally or fall back to cloud inference.

Move-to-Device Recommendations

The recommendation engine uses fleet-wide benchmark data to assess device compatibility when recommending models for on-device deployment.

Dashboard

The benchmark leaderboard is visible in the Octomil dashboard. The leaderboard shows:

  • Performance rankings by model, grouped by platform and backend
  • Throughput and latency distributions across submissions
  • Device coverage -- how many unique device types have contributed data for each model
  • Trends over time as new hardware and runtime versions are tested

Gotchas

  • First benchmark downloads the model — if the model isn't cached locally, the first run includes download time. Run octomil pull <model> first for accurate timing.
  • Background processes affect results — close other applications before benchmarking. Background CPU/GPU load skews latency and throughput numbers.
  • Thermal throttling on mobile — running multiple benchmark iterations on a phone can trigger thermal throttling, producing progressively slower results. Use --iterations 5 on mobile.
  • Shared benchmark submissions are anonymous but permanent — uploaded benchmark records cannot currently be retracted. Use --local for private runs or while testing non-representative configurations.
  • Leaderboard averages across submissions — a single outlier submission doesn't dominate. The leaderboard shows the mean across all submissions for a given model/platform/backend tuple.
  • Benchmark ≠ production performance — benchmarks use synthetic prompts with fixed token counts. Real-world performance varies with prompt length, concurrent requests, and memory pressure.