Benchmarks
Measure model performance before deploying to your fleet. Benchmarks cover latency, throughput, memory footprint, and output quality.
Run a benchmark
- CLI
- Python
octomil eval phi-4-mini --benchmark
Model: phi-4-mini (Q4_K_M, 2.2 GB)
Engine: mlx
Latency
Time to first token: 142 ms
Tokens/second: 38.4 tok/s
End-to-end (128 tok): 3.47 s
Memory
Peak RSS: 2.8 GB
Model load time: 1.2 s
Quality (mmlu-mini, 100 samples)
Accuracy: 0.72
Baseline (cloud): 0.81
Delta: -0.09
from octomil import OctomilClient
client = OctomilClient(api_key="edg_...")
result = client.eval.benchmark(
model_id="phi-4-mini",
dataset="mmlu-mini",
samples=100,
)
print(f"TTFT: {result.ttft_ms}ms")
print(f"Throughput: {result.tokens_per_second} tok/s")
print(f"Accuracy: {result.accuracy}")
What gets measured
| Metric | Description |
|---|---|
| Time to first token (TTFT) | Latency from request to first generated token |
| Tokens per second | Sustained generation throughput |
| End-to-end latency | Total time for a fixed-length generation |
| Peak memory (RSS) | Maximum memory usage during inference |
| Model load time | Time to load model weights from disk |
| Accuracy | Correctness on an evaluation dataset |
| Quality delta | Difference vs. a cloud baseline |
Evaluation datasets
Use built-in datasets or bring your own:
# Built-in dataset
octomil eval phi-4-mini --dataset mmlu-mini
# Custom dataset (JSONL with prompt/expected pairs)
octomil eval phi-4-mini --dataset ./my-eval.jsonl
Custom dataset format:
{"prompt": "What is the capital of France?", "expected": "Paris"}
{"prompt": "2 + 2 = ?", "expected": "4"}
Compare models
Benchmark multiple models side-by-side:
octomil eval phi-4-mini gemma3-1b qwen-1.5b --benchmark --dataset mmlu-mini
phi-4-mini gemma3-1b qwen-1.5b
TTFT (ms) 142 89 104
Tokens/s 38.4 52.1 47.8
Memory (GB) 2.8 1.4 1.6
Accuracy 0.72 0.58 0.65
Device-specific benchmarks
Profile performance on a specific device class:
octomil eval phi-4-mini --benchmark --device-profile iphone-15-pro
See Device Profiling for available device profiles and how to register custom hardware specs.
CI integration
Run benchmarks in CI to catch performance regressions:
octomil eval phi-4-mini --benchmark --assert-ttft-ms 200 --assert-tok-s 30
Exits with non-zero status if any assertion fails.
Related
- Quality Evaluation -- detailed eval configuration
- Device Profiling -- hardware capability detection
- Compatibility matrix -- model-device compatibility