Skip to main content

Jobs Dashboard

The Jobs page (/jobs) is the client-facing control plane for asynchronous operations, scheduled automation, and failure recovery.

Job types

Octomil tracks four categories of asynchronous work:

Job typeDescriptionTypical duration
Training roundOne federated learning round (select devices, collect updates, aggregate)1--30 min
Model conversionConvert a model between formats (PyTorch to ONNX to CoreML/TFLite)30s--5 min
Deployment rolloutProgressive rollout of a model version to devicesMinutes to hours
Scheduled taskRecurring maintenance (cache cleanup, metric aggregation, health checks)Seconds

Job lifecycle

Every job follows the same state machine:

queued → running → completed
→ failed → (retry) → queued
→ cancelled
StatusDescription
queuedWaiting to execute. Jobs run in priority order.
runningActively executing. Progress percentage shown when available.
completedFinished successfully. Results stored for inspection.
failedExecution failed. Error message and stack trace available. Retryable.
cancelledManually cancelled by a user or by a dependent job failure.

Managing jobs

Viewing job details

Click any job row to see:

  • Progress — percentage complete (for training rounds: devices reported / devices selected)
  • Attempts — how many times the job has been tried, with timestamps
  • Error text — for failed jobs, the error message and context
  • Duration — wall-clock time from start to completion
  • Related resources — links to the model, version, or round involved

Retry and cancel

  • Retry: Click the retry button on a failed job. The job re-enters queued with its attempt counter incremented.
  • Cancel: Click cancel on a queued or running job. Running jobs are interrupted gracefully.

Both actions are recorded in the Audit Trail.

Training round jobs

When you start federated training (via the SDK or dashboard), each round creates a job:

  1. Device selection — the server selects eligible devices based on availability, battery, and network status.
  2. Update collection — selected devices train locally and upload model updates. Progress reflects how many devices have reported.
  3. Aggregation — once min_updates is reached, the server aggregates updates using the configured strategy (FedAvg, FedProx, Ditto, etc.).
  4. Version creation — the aggregated model is saved as a new version.

If too few devices report within the round timeout, the job fails with a "insufficient updates" error. You can retry the round or adjust min_updates.

Model conversion jobs

Model conversion now runs locally via the CLI rather than as server-side jobs. Use octomil convert to convert models on your machine before uploading:

# Convert locally
octomil convert model.pt --formats onnx,coreml,tflite --output converted_models

# Convert and push in one step
octomil convert model.pt --formats onnx,coreml,tflite --push --model-id my-model --version 1.0.0

The jobs dashboard still tracks conversion-related upload jobs. If a conversion fails locally (e.g., unsupported ONNX op for TFLite), fix the issue and re-run octomil convert for the failed format.

Filter the jobs list by:

  • Type — training, conversion, deployment, scheduled
  • Status — queued, running, completed, failed, cancelled
  • Time range — last hour, last 24h, last 7 days, custom
  • Model — filter by model name or ID