Experiments

Compare model versions on real traffic before committing to a full rollout.

When to use

You trained a new model version and want to validate it against production
You're testing different quantizations or model sizes on the same fleet
You need statistical confidence before a full deployment

Quick start

CLI
Python
cURL

octomil experiment create my-experiment \
  --model radiology-v1 \
  --control 1.0.0 \
  --treatment 2.0.0 \
  --traffic-split 50

Experiment created: exp_a1b2c3d4
Model: radiology-v1
Control: 1.0.0 (50%)
Treatment: 2.0.0 (50%)
Status: running

from octomil import OctomilClient

client = OctomilClient(api_key="oct_sk_live_...")

experiment = client.experiments.create(
    name="my-experiment",
    model_id="radiology-v1",
    control_version="1.0.0",
    treatment_version="2.0.0",
    traffic_split=50,
)
print(experiment.id)  # exp_a1b2c3d4

curl -X POST https://api.octomil.com/v1/experiments \
  -H "Authorization: Bearer oct_sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-experiment",
    "model_id": "radiology-v1",
    "control_version": "1.0.0",
    "treatment_version": "2.0.0",
    "traffic_split": 50
  }'

Manage traffic

Adjust the split after launch. Increase treatment traffic as confidence grows, or reduce it if metrics regress.

CLI
Python

# Shift more traffic to treatment
octomil experiment update my-experiment --traffic-split 80

# Pause the experiment
octomil experiment pause my-experiment

# Resume
octomil experiment resume my-experiment

client.experiments.update_split("exp_a1b2c3d4", traffic_split=80)
client.experiments.pause("exp_a1b2c3d4")
client.experiments.resume("exp_a1b2c3d4")

Target device groups

Constrain an experiment to specific device cohorts instead of the full fleet:

octomil experiment create staging-test \
  --model radiology-v1 \
  --control 1.0.0 \
  --treatment 2.0.0 \
  --traffic-split 50 \
  --group staging-devices

View results

CLI
Python

octomil experiment results my-experiment

Experiment: my-experiment (running, 3 days)
Model: radiology-v1

            Control (1.0.0)    Treatment (2.0.0)
Devices     124                 118
Accuracy    0.871               0.894
Latency     42ms                39ms
Error rate  0.3%                0.2%

Statistical significance: p=0.023 (significant)
Recommendation: promote treatment

results = client.experiments.results("exp_a1b2c3d4")
print(f"Control accuracy: {results.control.accuracy}")
print(f"Treatment accuracy: {results.treatment.accuracy}")
print(f"Significant: {results.is_significant}")

Promote or stop

# Promote treatment → starts a full rollout of the winning version
octomil experiment promote my-experiment

# Stop → keeps control version, discards experiment
octomil experiment stop my-experiment

Options

Option	Default	Description
`--traffic-split`	`50`	Percentage of traffic to treatment (rest goes to control)
`--group`	all devices	Target a specific device group
`--metric`	`accuracy`	Primary metric for comparison
`--min-samples`	`100`	Minimum samples before significance test runs
`--auto-promote`	off	Automatically promote treatment if significant and better

Gotchas

Traffic splits are bounded by rollout enforcement — if a rollout is already active for the same model, the experiment's traffic comes from the rollout's allocated percentage. Coordinate traffic allocation between experiments and rollouts to avoid conflicts.
Minimum sample size matters — results shown before --min-samples is reached are directional, not statistically significant. Don't act on early results.
Device assignment is sticky — a device stays in its assigned group (control or treatment) for the duration of the experiment. Changing the split only affects newly assigned devices.

Rollouts — deploy the winning version after an experiment
Device Groups — target experiments to specific cohorts
Monitoring — track experiment metrics in real time

When to use​

Quick start​

Manage traffic​

Target device groups​

View results​

Promote or stop​

Options​

Gotchas​

Related​