Skip to main content

Benchmarks

Measure model performance before deploying to your fleet. Benchmarks cover latency, throughput, memory footprint, and output quality.

Run a benchmark

octomil eval phi-4-mini --benchmark
Model: phi-4-mini (Q4_K_M, 2.2 GB)
Engine: mlx

Latency
Time to first token: 142 ms
Tokens/second: 38.4 tok/s
End-to-end (128 tok): 3.47 s

Memory
Peak RSS: 2.8 GB
Model load time: 1.2 s

Quality (mmlu-mini, 100 samples)
Accuracy: 0.72
Baseline (cloud): 0.81
Delta: -0.09

What gets measured

MetricDescription
Time to first token (TTFT)Latency from request to first generated token
Tokens per secondSustained generation throughput
End-to-end latencyTotal time for a fixed-length generation
Peak memory (RSS)Maximum memory usage during inference
Model load timeTime to load model weights from disk
AccuracyCorrectness on an evaluation dataset
Quality deltaDifference vs. a cloud baseline

Evaluation datasets

Use built-in datasets or bring your own:

# Built-in dataset
octomil eval phi-4-mini --dataset mmlu-mini

# Custom dataset (JSONL with prompt/expected pairs)
octomil eval phi-4-mini --dataset ./my-eval.jsonl

Custom dataset format:

{"prompt": "What is the capital of France?", "expected": "Paris"}
{"prompt": "2 + 2 = ?", "expected": "4"}

Compare models

Benchmark multiple models side-by-side:

octomil eval phi-4-mini gemma3-1b qwen-1.5b --benchmark --dataset mmlu-mini
                   phi-4-mini    gemma3-1b    qwen-1.5b
TTFT (ms) 142 89 104
Tokens/s 38.4 52.1 47.8
Memory (GB) 2.8 1.4 1.6
Accuracy 0.72 0.58 0.65

Device-specific benchmarks

Profile performance on a specific device class:

octomil eval phi-4-mini --benchmark --device-profile iphone-15-pro

See Device Profiling for available device profiles and how to register custom hardware specs.

CI integration

Run benchmarks in CI to catch performance regressions:

octomil eval phi-4-mini --benchmark --assert-ttft-ms 200 --assert-tok-s 30

Exits with non-zero status if any assertion fails.