Jobs Dashboard
The Jobs page (/jobs) is the client-facing control plane for asynchronous operations, scheduled automation, and failure recovery.
Job types
Octomil tracks four categories of asynchronous work:
| Job type | Description | Typical duration |
|---|---|---|
| Training round | One federated learning round (select devices, collect updates, aggregate) | 1--30 min |
| Model conversion | Convert a model between formats (PyTorch to ONNX to CoreML/TFLite) | 30s--5 min |
| Deployment rollout | Progressive rollout of a model version to devices | Minutes to hours |
| Scheduled task | Recurring maintenance (cache cleanup, metric aggregation, health checks) | Seconds |
Job lifecycle
Every job follows the same state machine:
queued → running → completed
→ failed → (retry) → queued
→ cancelled
| Status | Description |
|---|---|
queued | Waiting to execute. Jobs run in priority order. |
running | Actively executing. Progress percentage shown when available. |
completed | Finished successfully. Results stored for inspection. |
failed | Execution failed. Error message and stack trace available. Retryable. |
cancelled | Manually cancelled by a user or by a dependent job failure. |
Managing jobs
Viewing job details
Click any job row to see:
- Progress — percentage complete (for training rounds: devices reported / devices selected)
- Attempts — how many times the job has been tried, with timestamps
- Error text — for failed jobs, the error message and context
- Duration — wall-clock time from start to completion
- Related resources — links to the model, version, or round involved
Retry and cancel
- Retry: Click the retry button on a failed job. The job re-enters
queuedwith its attempt counter incremented. - Cancel: Click cancel on a
queuedorrunningjob. Running jobs are interrupted gracefully.
Both actions are recorded in the Audit Trail.
Training round jobs
When you start federated training (via the SDK or dashboard), each round creates a job:
- Device selection — the server selects eligible devices based on availability, battery, and network status.
- Update collection — selected devices train locally and upload model updates. Progress reflects how many devices have reported.
- Aggregation — once
min_updatesis reached, the server aggregates updates using the configured strategy (FedAvg, FedProx, Ditto, etc.). - Version creation — the aggregated model is saved as a new version.
If too few devices report within the round timeout, the job fails with a "insufficient updates" error. You can retry the round or adjust min_updates.
Model conversion jobs
Model conversion now runs locally via the CLI rather than as server-side jobs. Use octomil convert to convert models on your machine before uploading:
# Convert locally
octomil convert model.pt --formats onnx,coreml,tflite --output converted_models
# Convert and push in one step
octomil convert model.pt --formats onnx,coreml,tflite --push --model-id my-model --version 1.0.0
The jobs dashboard still tracks conversion-related upload jobs. If a conversion fails locally (e.g., unsupported ONNX op for TFLite), fix the issue and re-run octomil convert for the failed format.
Filtering and search
Filter the jobs list by:
- Type — training, conversion, deployment, scheduled
- Status — queued, running, completed, failed, cancelled
- Time range — last hour, last 24h, last 7 days, custom
- Model — filter by model name or ID