Experiments
Compare model versions on real traffic before committing to a full rollout.
When to use
- You trained a new model version and want to validate it against production
- You're testing different quantizations or model sizes on the same fleet
- You need statistical confidence before a full deployment
Quick start
- CLI
- Python
- cURL
octomil experiment create my-experiment \
--model radiology-v1 \
--control 1.0.0 \
--treatment 2.0.0 \
--traffic-split 50
Experiment created: exp_a1b2c3d4
Model: radiology-v1
Control: 1.0.0 (50%)
Treatment: 2.0.0 (50%)
Status: running
from octomil import OctomilClient
client = OctomilClient(api_key="edg_...")
experiment = client.experiments.create(
name="my-experiment",
model_id="radiology-v1",
control_version="1.0.0",
treatment_version="2.0.0",
traffic_split=50,
)
print(experiment.id) # exp_a1b2c3d4
curl -X POST https://api.octomil.com/v1/experiments \
-H "Authorization: Bearer edg_..." \
-H "Content-Type: application/json" \
-d '{
"name": "my-experiment",
"model_id": "radiology-v1",
"control_version": "1.0.0",
"treatment_version": "2.0.0",
"traffic_split": 50
}'
Manage traffic
Adjust the split after launch. Increase treatment traffic as confidence grows, or reduce it if metrics regress.
- CLI
- Python
# Shift more traffic to treatment
octomil experiment update my-experiment --traffic-split 80
# Pause the experiment
octomil experiment pause my-experiment
# Resume
octomil experiment resume my-experiment
client.experiments.update_split("exp_a1b2c3d4", traffic_split=80)
client.experiments.pause("exp_a1b2c3d4")
client.experiments.resume("exp_a1b2c3d4")
Target device groups
Constrain an experiment to specific device cohorts instead of the full fleet:
octomil experiment create staging-test \
--model radiology-v1 \
--control 1.0.0 \
--treatment 2.0.0 \
--traffic-split 50 \
--group staging-devices
View results
- CLI
- Python
octomil experiment results my-experiment
Experiment: my-experiment (running, 3 days)
Model: radiology-v1
Control (1.0.0) Treatment (2.0.0)
Devices 124 118
Accuracy 0.871 0.894
Latency 42ms 39ms
Error rate 0.3% 0.2%
Statistical significance: p=0.023 (significant)
Recommendation: promote treatment
results = client.experiments.results("exp_a1b2c3d4")
print(f"Control accuracy: {results.control.accuracy}")
print(f"Treatment accuracy: {results.treatment.accuracy}")
print(f"Significant: {results.is_significant}")
Promote or stop
# Promote treatment → starts a full rollout of the winning version
octomil experiment promote my-experiment
# Stop → keeps control version, discards experiment
octomil experiment stop my-experiment
Options
| Option | Default | Description |
|---|---|---|
--traffic-split | 50 | Percentage of traffic to treatment (rest goes to control) |
--group | all devices | Target a specific device group |
--metric | accuracy | Primary metric for comparison |
--min-samples | 100 | Minimum samples before significance test runs |
--auto-promote | off | Automatically promote treatment if significant and better |
Gotchas
- Traffic splits are bounded by rollout enforcement — if a rollout is already active for the same model, the experiment's traffic comes from the rollout's allocated percentage. Coordinate traffic allocation between experiments and rollouts to avoid conflicts.
- Minimum sample size matters — results shown before
--min-samplesis reached are directional, not statistically significant. Don't act on early results. - Device assignment is sticky — a device stays in its assigned group (control or treatment) for the duration of the experiment. Changing the split only affects newly assigned devices.
Related
- Rollouts — deploy the winning version after an experiment
- Device Groups — target experiments to specific cohorts
- Monitoring — track experiment metrics in real time