Skip to main content

Quality Evaluation

Before deploying a model to edge devices, verify that on-device inference quality matches cloud quality. Octomil's eval harness runs the same inputs through both paths and reports statistical differences.

CLI: octomil eval

octomil eval phi-4-mini --test-data test.jsonl --threshold 0.95

Test Data Format

JSONL file where each line has an input key and an optional expected_output:

{"input": "What is 2+2?", "expected_output": "4"}
{"input": "Translate 'hello' to French", "expected_output": "bonjour"}
{"input": "Summarize: The quick brown fox jumps over the lazy dog."}

Options

FlagDefaultDescription
--test-data, -d(required)Path to JSONL test file
--threshold, -t0.95Minimum quality score (0.0-1.0)
--api-basehttp://localhost:8000Server URL
--metrics, -msimilarity,exact_match,latencyComma-separated metrics

Output

Quality Evaluation: phi-4-mini
================================
Inputs evaluated: 50
Overall quality score: 0.97

Metrics:
similarity: 0.98 (mean), 0.94 (p5)
exact_match: 0.82
latency_ratio: 1.2x (device vs cloud)

Result: PASS (0.97 >= 0.95 threshold)

Exit code is 0 on pass, 1 on fail -- use in CI pipelines.

API Endpoints

EndpointMethodDescription
/api/v1/evalPOSTRun a quality evaluation
/api/v1/eval/{eval_id}GETGet evaluation results
/api/v1/eval/history/{model_id}GETList past evaluations for a model

Run an Evaluation

curl -X POST http://localhost:8000/api/v1/eval \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model_id": "phi-4-mini",
"inputs": ["What is 2+2?", "Translate hello to French"],
"threshold": 0.95
}'

Response includes per-input comparison and aggregate statistics using Welch's t-test and Cohen's d effect size.

Integration with octomil push

Auto-quantization can gate on quality:

octomil push phi-4-mini --quantize int8 --quality-threshold 0.95

This quantizes the model and runs a quality eval. If the score falls below 0.95, the push is aborted and the original model is preserved.

Metrics

MetricDescription
similarityText similarity between cloud and device outputs (difflib)
exact_matchFraction of outputs that match exactly
latencyInference time comparison (device vs cloud)

Workflow

  1. Prepare test data: Create a JSONL file with representative inputs
  2. Run eval: octomil eval <model> -d test.jsonl
  3. Review results: Check quality score and per-metric breakdown
  4. Gate deployment: Use --quality-threshold on octomil push to automate
  5. Track history: Query /api/v1/eval/history/{model_id} to see trends over time