Skip to main content

Experiments

Compare model versions on real traffic before committing to a full rollout.

When to use

  • You trained a new model version and want to validate it against production
  • You're testing different quantizations or model sizes on the same fleet
  • You need statistical confidence before a full deployment

Quick start

octomil experiment create my-experiment \
--model radiology-v1 \
--control 1.0.0 \
--treatment 2.0.0 \
--traffic-split 50
Experiment created: exp_a1b2c3d4
Model: radiology-v1
Control: 1.0.0 (50%)
Treatment: 2.0.0 (50%)
Status: running

Manage traffic

Adjust the split after launch. Increase treatment traffic as confidence grows, or reduce it if metrics regress.

# Shift more traffic to treatment
octomil experiment update my-experiment --traffic-split 80

# Pause the experiment
octomil experiment pause my-experiment

# Resume
octomil experiment resume my-experiment

Target device groups

Constrain an experiment to specific device cohorts instead of the full fleet:

octomil experiment create staging-test \
--model radiology-v1 \
--control 1.0.0 \
--treatment 2.0.0 \
--traffic-split 50 \
--group staging-devices

View results

octomil experiment results my-experiment
Experiment: my-experiment (running, 3 days)
Model: radiology-v1

Control (1.0.0) Treatment (2.0.0)
Devices 124 118
Accuracy 0.871 0.894
Latency 42ms 39ms
Error rate 0.3% 0.2%

Statistical significance: p=0.023 (significant)
Recommendation: promote treatment

Promote or stop

# Promote treatment → starts a full rollout of the winning version
octomil experiment promote my-experiment

# Stop → keeps control version, discards experiment
octomil experiment stop my-experiment

Options

OptionDefaultDescription
--traffic-split50Percentage of traffic to treatment (rest goes to control)
--groupall devicesTarget a specific device group
--metricaccuracyPrimary metric for comparison
--min-samples100Minimum samples before significance test runs
--auto-promoteoffAutomatically promote treatment if significant and better

Gotchas

  • Traffic splits are bounded by rollout enforcement — if a rollout is already active for the same model, the experiment's traffic comes from the rollout's allocated percentage. Coordinate traffic allocation between experiments and rollouts to avoid conflicts.
  • Minimum sample size matters — results shown before --min-samples is reached are directional, not statistically significant. Don't act on early results.
  • Device assignment is sticky — a device stays in its assigned group (control or treatment) for the duration of the experiment. Changing the split only affects newly assigned devices.
  • Rollouts — deploy the winning version after an experiment
  • Device Groups — target experiments to specific cohorts
  • Monitoring — track experiment metrics in real time