Skip to main content

Golden Path

The recommended path from zero to a model running on your production fleet. Follow these steps in order.


1. Pick a model

Browse the model catalog or bring your own. Octomil works with formats such as GGUF, ONNX, CoreML, TFLite, and SafeTensors.

# List available models
octomil models list

# Search for a specific architecture
octomil models search "phi"

Choose based on your target devices. Smaller models often fit phones and tablets more comfortably, while larger models typically need laptops or desktops.


2. Push and convert

octomil push phi-4-mini

Octomil imports the model, converts it to edge-optimized formats, and registers it in your org's model registry. Add --quantize to automatically select the best quantization for your target hardware.

Verify the model works locally before deploying:

octomil serve phi-4-mini

Open http://localhost:8080 to chat with the model.


3. Integrate the SDK

Add the SDK to your app and initialize the client:

import octomil

client = octomil.Client(api_key="edg_...", org_id="your-org-id")
response = client.predict("phi-4-mini", [{"role": "user", "content": "Hello"}])
print(response)

4. Run a quality eval

Before deploying to users, verify on-device quality matches expectations:

octomil eval phi-4-mini --dataset my-eval-set

Compare latency, quality, and resource usage against your baseline. See Benchmarks for evaluation datasets and device profiles.


5. Deploy with a canary rollout

Start with a small percentage of your fleet:

octomil deploy phi-4-mini --group production --rollout 10%

Monitor error rates and latency in the dashboard. If metrics look healthy, advance:

octomil rollout advance phi-4-mini

Repeat until you reach 100%. If something goes wrong at any step:

octomil rollback phi-4-mini --to-version 1.0.0

See Rollouts for strategies and Rollback for recovery options.


6. Monitor in production

octomil dashboard

Track inference latency, device health, error rates, and resource consumption. Set up alerts for regressions and use Telemetry for custom metrics and exports.


What's next

  • Responses -- understand the inference response format
  • Control -- constrain model output with schemas and parameters
  • Routing -- serve multiple models and route by query complexity
  • Tool calling -- let models call functions and APIs