Device Profiling

Octomil maintains a database of real-world inference performance data across device types, collected anonymously from octomil benchmark runs when sharing is enabled (default behavior with an API key). This data powers deployment recommendations, routing decisions, and compatibility checks -- so when you ask "will this model run well on an iPhone 14?", the answer comes from measured data, not estimates.

The Data Flywheel

Every shared octomil benchmark run contributes anonymous performance data. More submissions produce better recommendations for everyone. No personally identifiable information is collected -- submissions include runtime/hardware characteristics and performance numbers only.

You run a benchmark
    |
    v
octomil benchmark             -->  Anonymous data submitted*
    |                                    |
    v                                    v
Your local results            Aggregated into device profiles
                                         |
                                         v
                              Better recommendations for
                              routing, compatibility, and
                              deployment decisions

* Upload occurs only when an API key is configured. Use --local to skip upload.

Running a Benchmark

octomil benchmark phi-4-mini

Output:

Model:         phi-4-mini (3.8B params)
Engine:        mlx
Platform:      darwin-arm64 (Apple M3 Pro)
RAM:           18 GB
Iterations:    10

Latency:
  avg:         55.2 ms
  p50:         54.8 ms
  p95:         58.1 ms
  p99:         61.3 ms

Throughput:
  avg:         38.7 tok/s
  peak:        42.1 tok/s

Token Timing:
  TTFT:        142.3 ms
  TPOT:        25.8 ms

Memory:
  peak:        4.2 GB

octomil benchmark uploads anonymous metrics by default when an API key is present. If no API key is configured, nothing is uploaded. Use --local to force local-only mode.

octomil benchmark phi-4-mini        # default: share if logged in
octomil benchmark phi-4-mini --local

Shared fields: model name, backend/runtime, platform, CPU architecture, OS version, accelerator type, total RAM, iteration count, prompt/completion token counts, latency distribution (avg/min/max/p50/p90/p95/p99), TTFT, TPOT, throughput, and peak memory. Not shared: prompts, model outputs, model files/weights, IP addresses, device IDs, or user profile data.

Benchmark Options

Option	Default	Description
`--local`	off	Keep results local (no upload)
`--iterations`	`10`	Number of benchmark iterations
`--max-tokens`	`50`	Max tokens per iteration
`--engine`	auto	Force a specific engine
`--all-engines`	off	Benchmark all available engines

Device Profiles API

Aggregated performance data is available via the API.

GET /api/v1/benchmarks/leaderboard

Retrieve the performance leaderboard -- best throughput per model, platform, and backend combination.

cURL
Python
JavaScript

curl https://api.octomil.com/api/v1/benchmarks/leaderboard?model=phi-4-mini

import requests

response = requests.get(
    "https://api.octomil.com/api/v1/benchmarks/leaderboard",
    params={"model": "phi-4-mini"},
)
print(response.json())

const response = await fetch(
  "https://api.octomil.com/api/v1/benchmarks/leaderboard?model=phi-4-mini"
);
const data = await response.json();
console.log(data);

[
  {
    "model": "phi-4-mini",
    "backend": "mlx",
    "platform": "darwin",
    "arch": "arm64",
    "avg_tokens_per_second": 41.3,
    "avg_latency_ms": 52.8,
    "avg_ttft_ms": 138.4,
    "avg_tpot_ms": 24.2,
    "avg_p99_latency_ms": 59.7,
    "submissions": 847
  },
  {
    "model": "phi-4-mini",
    "backend": "llamacpp",
    "platform": "linux",
    "arch": "x86_64",
    "avg_tokens_per_second": 28.6,
    "avg_latency_ms": 71.4,
    "avg_ttft_ms": 195.2,
    "avg_tpot_ms": 35.0,
    "avg_p99_latency_ms": 88.1,
    "submissions": 312
  }
]

GET /api/v1/benchmarks

List individual benchmark submissions. Filter by model, platform, or backend.

cURL
Python
JavaScript

curl "https://api.octomil.com/api/v1/benchmarks?model=phi-4-mini&platform=darwin&limit=10"

import requests

response = requests.get(
    "https://api.octomil.com/api/v1/benchmarks",
    params={"model": "phi-4-mini", "platform": "darwin", "limit": 10},
)
print(response.json())

const params = new URLSearchParams({
  model: "phi-4-mini",
  platform: "darwin",
  limit: "10",
});
const response = await fetch(
  `https://api.octomil.com/api/v1/benchmarks?${params}`
);
const data = await response.json();
console.log(data);

POST /api/v1/benchmarks

Submit benchmark results programmatically (used by CLI sharing when benchmarks are uploaded).

cURL
Python
JavaScript

curl -X POST https://api.octomil.com/api/v1/benchmarks \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-4-mini",
    "model_size_bytes": 2147483648,
    "quantization": "q4_k_m",
    "backend": "mlx",
    "platform": "darwin",
    "arch": "arm64",
    "os_version": "15.3",
    "ram_total_bytes": 18253611008,
    "iterations": 10,
    "prompt_tokens": 32,
    "completion_tokens": 128,
    "avg_latency_ms": 55.2,
    "p50_latency_ms": 54.8,
    "p95_latency_ms": 58.1,
    "p99_latency_ms": 61.3,
    "ttft_ms": 142.3,
    "tpot_ms": 25.8,
    "avg_tokens_per_second": 38.7,
    "peak_tokens_per_second": 42.1,
    "peak_memory_bytes": 4509715456,
    "source": "manual"
  }'

import requests

response = requests.post(
    "https://api.octomil.com/api/v1/benchmarks",
    headers={"Authorization": "Bearer <token>"},
    json={
        "model": "phi-4-mini",
        "model_size_bytes": 2147483648,
        "quantization": "q4_k_m",
        "backend": "mlx",
        "platform": "darwin",
        "arch": "arm64",
        "os_version": "15.3",
        "ram_total_bytes": 18253611008,
        "iterations": 10,
        "prompt_tokens": 32,
        "completion_tokens": 128,
        "avg_latency_ms": 55.2,
        "p50_latency_ms": 54.8,
        "p95_latency_ms": 58.1,
        "p99_latency_ms": 61.3,
        "ttft_ms": 142.3,
        "tpot_ms": 25.8,
        "avg_tokens_per_second": 38.7,
        "peak_tokens_per_second": 42.1,
        "peak_memory_bytes": 4509715456,
        "source": "manual",
    },
)
print(response.json())

const response = await fetch("https://api.octomil.com/api/v1/benchmarks", {
  method: "POST",
  headers: {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "phi-4-mini",
    model_size_bytes: 2147483648,
    quantization: "q4_k_m",
    backend: "mlx",
    platform: "darwin",
    arch: "arm64",
    os_version: "15.3",
    ram_total_bytes: 18253611008,
    iterations: 10,
    prompt_tokens: 32,
    completion_tokens: 128,
    avg_latency_ms: 55.2,
    p50_latency_ms: 54.8,
    p95_latency_ms: 58.1,
    p99_latency_ms: 61.3,
    ttft_ms: 142.3,
    tpot_ms: 25.8,
    avg_tokens_per_second: 38.7,
    peak_tokens_per_second: 42.1,
    peak_memory_bytes: 4509715456,
    source: "manual",
  }),
});
const data = await response.json();
console.log(data);

How Profiles Are Used

Device profiling data feeds into multiple Octomil systems:

Deployment Orchestration

When you deploy a model to your device fleet, Octomil checks measured performance data for each device class. If real benchmark data exists for a device/model combination, deployment plans use data_source: "measured" instead of "estimated":

{
  "device_id": "device-abc",
  "format": "coreml",
  "executor": "ane",
  "quantization": "int4",
  "runtime_config": {
    "data_source": "measured",
    "expected_throughput_tok_s": 38.7,
    "expected_latency_ms": 55.2
  }
}

Measured data produces more accurate deployment decisions and reduces the risk of deploying models to devices that can't handle them well.

Routing Decisions

The model routing engine uses profiling data to estimate on-device latency and determine whether a model should run locally or fall back to cloud inference.

Move-to-Device Recommendations

The recommendation engine uses fleet-wide benchmark data to assess device compatibility when recommending models for on-device deployment.

Dashboard

The benchmark leaderboard is visible in the Octomil dashboard. The leaderboard shows:

Performance rankings by model, grouped by platform and backend
Throughput and latency distributions across submissions
Device coverage -- how many unique device types have contributed data for each model
Trends over time as new hardware and runtime versions are tested

Gotchas

First benchmark downloads the model — if the model isn't cached locally, the first run includes download time. Run octomil pull <model> first for accurate timing.
Background processes affect results — close other applications before benchmarking. Background CPU/GPU load skews latency and throughput numbers.
Thermal throttling on mobile — running multiple benchmark iterations on a phone can trigger thermal throttling, producing progressively slower results. Use --iterations 5 on mobile.
Shared benchmark submissions are anonymous but permanent — uploaded benchmark records cannot currently be retracted. Use --local for private runs or while testing non-representative configurations.
Leaderboard averages across submissions — a single outlier submission doesn't dominate. The leaderboard shows the mean across all submissions for a given model/platform/backend tuple.
Benchmark ≠ production performance — benchmarks use synthetic prompts with fixed token counts. Real-world performance varies with prompt length, concurrent requests, and memory pressure.

Local Inference — run models locally
Model Routing — routing decisions powered by profiling data
Device Targeting — deployment recommendations using fleet data
Speculative Decoding — benchmark speculative vs standard performance
Observability — inference metrics reported to dashboard

The Data Flywheel​

Running a Benchmark​

Sharing Results​

Benchmark Options​

Device Profiles API​

GET /api/v1/benchmarks/leaderboard​

GET /api/v1/benchmarks​

POST /api/v1/benchmarks​

How Profiles Are Used​

Deployment Orchestration​

Routing Decisions​

Move-to-Device Recommendations​

Dashboard​

Gotchas​

Related​