Skip to main content

Federated Training

Run training across devices from multiple organizations without centralizing data.

How it works

  1. App developer configures the SDK on each device (done once)
  2. Federation owner starts the training run from the CLI
  3. Each round: server selects devices, devices train locally, devices send weight updates, server aggregates

Configure devices

Install the SDK and register each device with the Octomil server. This runs on the device itself:

from octomil import FederatedClient

client = FederatedClient(api_key="edg_...", device_identifier="hospital-001")
client.register()

Once registered, the device sits idle until the server starts a training round.

Start training

The federation owner triggers the run:

octomil train start radiology-v1 \
--strategy fedavg \
--rounds 50 \
--group production

Each round:

  1. Server selects participating devices from all active member organizations
  2. Server sends current model weights to those devices
  3. Each device trains locally on its own data
  4. Each device sends back weight updates (gradients) — never raw data
  5. Server aggregates updates into an improved model
  6. Improved model is sent to devices for the next round

Send weight updates

When a training round starts, the SDK trains on local data and submits the updated weights automatically:

from octomil import FederatedClient

client = FederatedClient(api_key="edg_...", device_identifier="hospital-001")
client.register()

# Option 1: Point to local data — SDK handles training and upload
client.train(model="radiology-v1", data="/data/patients.csv", target_col="diagnosis")

# Option 2: Full control — bring your own training loop
def train_locally(base_state_dict):
model = MyModel()
model.load_state_dict(base_state_dict)
train_one_epoch(model, local_dataloader)
return model.state_dict(), len(local_data), {"loss": 0.42}

client.train_from_remote(model="radiology-v1", local_train_fn=train_locally, rounds=5)

Monitor training

octomil train status radiology-v1
Training: radiology-v1 (fed_a1b2c3d4)
Strategy: fedavg
Round: 23 / 50 Status: IN_PROGRESS

Contributions this round:
org_7f8a9b0c Acme Health 12 devices updates: 12/12
org_abc123 Metro General 8 devices updates: 6/8

Overall accuracy: 0.874 (improving)

Run inference

After training completes, each device downloads the trained model and runs predictions locally — no cloud calls:

# Python uses the CLI for local inference
# Start the inference server:
# octomil serve radiology-v1

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed") # pragma: allowlist secret
response = client.chat.completions.create(
model="radiology-v1",
messages=[{"role": "user", "content": "Classify this scan."}],
)
print(response.choices[0].message.content)

Rules

  • Raw data never leaves the device
  • Only active federation members can submit updates
  • Each organization controls its own differential privacy budget: octomil team set-policy --privacy-budget <epsilon>
  • Federation-scoped aggregation: updates from member devices only
  • Per-organization contribution tracked per round

Deploy the trained model

When training completes, the server saves aggregated weights as a new version:

Training complete: radiology-v1
New version: 2.0.0 (aggregated from 50 rounds, 2 organizations)

Each organization deploys independently:

octomil deploy radiology-v1 --version 2.0.0 --rollout 10%
# validate metrics
octomil deploy radiology-v1 --version 2.0.0 --rollout 100%

Consider an A/B experiment comparing v1.0.0 against v2.0.0 before full rollout.

If auto-rollback is enabled, deployments exceeding the error threshold roll back automatically:

octomil rollback radiology-v1 --to-version 1.0.0