Jobs Dashboard

The Jobs page (/jobs) is the client-facing control plane for asynchronous operations, scheduled automation, and failure recovery.

Job types

Octomil tracks four categories of asynchronous work:

Job type	Description	Typical duration
Training round	One federated learning round (select devices, collect updates, aggregate)	1--30 min
Model conversion	Convert a model between formats (PyTorch to ONNX to CoreML/TFLite)	30s--5 min
Deployment rollout	Progressive rollout of a model version to devices	Minutes to hours
Scheduled task	Recurring maintenance (cache cleanup, metric aggregation, health checks)	Seconds

Job lifecycle

Every job follows the same state machine:

queued → running → completed
                 → failed → (retry) → queued
         → cancelled

Status	Description
`queued`	Waiting to execute. Jobs run in priority order.
`running`	Actively executing. Progress percentage shown when available.
`completed`	Finished successfully. Results stored for inspection.
`failed`	Execution failed. Error message and stack trace available. Retryable.
`cancelled`	Manually cancelled by a user or by a dependent job failure.

Managing jobs

Viewing job details

Click any job row to see:

Progress — percentage complete (for training rounds: devices reported / devices selected)
Attempts — how many times the job has been tried, with timestamps
Error text — for failed jobs, the error message and context
Duration — wall-clock time from start to completion
Related resources — links to the model, version, or round involved

Retry and cancel

Retry: Click the retry button on a failed job. The job re-enters queued with its attempt counter incremented.
Cancel: Click cancel on a queued or running job. Running jobs are interrupted gracefully.

Both actions are recorded in the Audit Trail.

Training round jobs

When you start federated training (via the SDK or dashboard), each round creates a job:

Device selection — the server selects eligible devices based on availability, battery, and network status.
Update collection — selected devices train locally and upload model updates. Progress reflects how many devices have reported.
Aggregation — once min_updates is reached, the server aggregates updates using the configured strategy (FedAvg, FedProx, Ditto, etc.).
Version creation — the aggregated model is saved as a new version.

If too few devices report within the round timeout, the job fails with a "insufficient updates" error. You can retry the round or adjust min_updates.

Model conversion jobs

Model conversion now runs locally via the CLI rather than as server-side jobs. Use octomil convert to convert models on your machine before uploading:

# Convert locally
octomil convert model.pt --formats onnx,coreml,tflite --output converted_models

# Convert and push in one step
octomil convert model.pt --formats onnx,coreml,tflite --push --model-id my-model --version 1.0.0

The jobs dashboard still tracks conversion-related upload jobs. If a conversion fails locally (e.g., unsupported ONNX op for TFLite), fix the issue and re-run octomil convert for the failed format.

Filtering and search

Filter the jobs list by:

Type — training, conversion, deployment, scheduled
Status — queued, running, completed, failed, cancelled
Time range — last hour, last 24h, last 7 days, custom
Model — filter by model name or ID

Job types​

Job lifecycle​

Managing jobs​

Viewing job details​

Retry and cancel​

Training round jobs​

Model conversion jobs​

Filtering and search​

Related Documentation​