Quality Evaluation

Before deploying a model to edge devices, verify that on-device inference quality matches cloud quality. Octomil's eval harness runs the same inputs through both paths and reports statistical differences.

CLI: `octomil eval`

octomil eval phi-4-mini --test-data test.jsonl --threshold 0.95

Test Data Format

JSONL file where each line has an input key and an optional expected_output:

{"input": "What is 2+2?", "expected_output": "4"}
{"input": "Translate 'hello' to French", "expected_output": "bonjour"}
{"input": "Summarize: The quick brown fox jumps over the lazy dog."}

Options

Flag	Default	Description
`--test-data, -d`	(required)	Path to JSONL test file
`--threshold, -t`	`0.95`	Minimum quality score (0.0-1.0)
`--api-base`	`http://localhost:8000`	Server URL
`--metrics, -m`	`similarity,exact_match,latency`	Comma-separated metrics

Output

Quality Evaluation: phi-4-mini
================================
Inputs evaluated: 50
Overall quality score: 0.97

Metrics:
  similarity: 0.98 (mean), 0.94 (p5)
  exact_match: 0.82
  latency_ratio: 1.2x (device vs cloud)

Result: PASS (0.97 >= 0.95 threshold)

Exit code is 0 on pass, 1 on fail -- use in CI pipelines.

API Endpoints

Endpoint	Method	Description
`/api/v1/eval`	POST	Run a quality evaluation
`/api/v1/eval/{eval_id}`	GET	Get evaluation results
`/api/v1/eval/history/{model_id}`	GET	List past evaluations for a model

Run an Evaluation

curl -X POST http://localhost:8000/api/v1/eval \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "phi-4-mini",
    "inputs": ["What is 2+2?", "Translate hello to French"],
    "threshold": 0.95
  }'

Response includes per-input comparison and aggregate statistics using Welch's t-test and Cohen's d effect size.

Integration with `octomil push`

Auto-quantization can gate on quality:

octomil push phi-4-mini --quantize int8 --quality-threshold 0.95

This quantizes the model and runs a quality eval. If the score falls below 0.95, the push is aborted and the original model is preserved.

Metrics

Metric	Description
`similarity`	Text similarity between cloud and device outputs (difflib)
`exact_match`	Fraction of outputs that match exactly
`latency`	Inference time comparison (device vs cloud)

Workflow

Prepare test data: Create a JSONL file with representative inputs
Run eval: octomil eval <model> -d test.jsonl
Review results: Check quality score and per-metric breakdown
Gate deployment: Use --quality-threshold on octomil push to automate
Track history: Query /api/v1/eval/history/{model_id} to see trends over time

CLI: octomil eval​

Test Data Format​

Options​

Output​

API Endpoints​

Run an Evaluation​

Integration with octomil push​

Metrics​

Workflow​