Quality Evaluation
Before deploying a model to edge devices, verify that on-device inference quality matches cloud quality. Octomil's eval harness runs the same inputs through both paths and reports statistical differences.
CLI: octomil eval
octomil eval phi-4-mini --test-data test.jsonl --threshold 0.95
Test Data Format
JSONL file where each line has an input key and an optional expected_output:
{"input": "What is 2+2?", "expected_output": "4"}
{"input": "Translate 'hello' to French", "expected_output": "bonjour"}
{"input": "Summarize: The quick brown fox jumps over the lazy dog."}
Options
| Flag | Default | Description |
|---|---|---|
--test-data, -d | (required) | Path to JSONL test file |
--threshold, -t | 0.95 | Minimum quality score (0.0-1.0) |
--api-base | http://localhost:8000 | Server URL |
--metrics, -m | similarity,exact_match,latency | Comma-separated metrics |
Output
Quality Evaluation: phi-4-mini
================================
Inputs evaluated: 50
Overall quality score: 0.97
Metrics:
similarity: 0.98 (mean), 0.94 (p5)
exact_match: 0.82
latency_ratio: 1.2x (device vs cloud)
Result: PASS (0.97 >= 0.95 threshold)
Exit code is 0 on pass, 1 on fail -- use in CI pipelines.
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/v1/eval | POST | Run a quality evaluation |
/api/v1/eval/{eval_id} | GET | Get evaluation results |
/api/v1/eval/history/{model_id} | GET | List past evaluations for a model |
Run an Evaluation
curl -X POST http://localhost:8000/api/v1/eval \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model_id": "phi-4-mini",
"inputs": ["What is 2+2?", "Translate hello to French"],
"threshold": 0.95
}'
Response includes per-input comparison and aggregate statistics using Welch's t-test and Cohen's d effect size.
Integration with octomil push
Auto-quantization can gate on quality:
octomil push phi-4-mini --quantize int8 --quality-threshold 0.95
This quantizes the model and runs a quality eval. If the score falls below 0.95, the push is aborted and the original model is preserved.
Metrics
| Metric | Description |
|---|---|
similarity | Text similarity between cloud and device outputs (difflib) |
exact_match | Fraction of outputs that match exactly |
latency | Inference time comparison (device vs cloud) |
Workflow
- Prepare test data: Create a JSONL file with representative inputs
- Run eval:
octomil eval <model> -d test.jsonl - Review results: Check quality score and per-metric breakdown
- Gate deployment: Use
--quality-thresholdonoctomil pushto automate - Track history: Query
/api/v1/eval/history/{model_id}to see trends over time