Skip to main content

Device Targeting

Octomil analyzes your inference telemetry to identify models that would benefit from on-device deployment. When a cloud-served model accumulates enough request volume and your device fleet has compatible hardware, Octomil generates a recommendation with a cost savings estimate and a confidence score.

How It Works

The recommendation engine runs periodically against your telemetry data and evaluates each model along three dimensions:

  1. Request volume -- models with high daily request counts have the most to gain from moving to device.
  2. Device diversity -- models requested from many distinct devices are good candidates because deployment reaches more users.
  3. Fleet size -- a larger fleet of compatible devices increases the potential cost savings.

Analysis Criteria

A model is evaluated for recommendation when it exceeds the minimum daily request threshold (default: 1,000 requests/day). The engine then:

  • Calculates current cloud inference cost using the configured cost-per-millisecond rate
  • Estimates on-device cost savings accounting for an adoption factor (not all devices will be online)
  • Checks device compatibility against the model's size and format requirements
  • Produces a confidence score and recommendation type

Recommendation Types

TypeConfidence RangeMeaning
deploy_to_device> 0.8High confidence. The model is a strong candidate for on-device deployment.
canary_test0.5 -- 0.8Moderate confidence. Recommend starting with a canary rollout to validate.
monitor< 0.5Low confidence. Continue collecting telemetry before deciding.

Confidence Scoring

The confidence score is a weighted composite of three signals:

SignalWeightWhat It Measures
Request volume40%Daily request count relative to the minimum threshold. Higher volume increases confidence.
Device diversity30%Number of unique devices requesting the model. More devices means broader deployment impact.
Fleet size30%Total compatible devices in the fleet. Larger fleets amplify cost savings.

The score is normalized to 0.0 -- 1.0. A score of 1.0 means the model has high volume, is requested from many devices, and the fleet is large enough to absorb the workload.

Cost Model

The cost savings estimate uses a straightforward formula:

estimated_monthly_savings = daily_requests * avg_latency_ms * cost_per_ms * 30 * adoption_factor
ParameterDefaultDescription
cost_per_ms$0.000003Cloud inference cost per millisecond of compute time
adoption_factor0.7Fraction of devices expected to successfully adopt the on-device model

These values are configurable. See Configuration below.

API Endpoints

GET /api/v1/recommendations

List all current recommendations.

curl -H "Authorization: Bearer <token>" \
https://api.octomil.com/api/v1/recommendations

Response:

{
"recommendations": [
{
"model_id": "text-classifier-v3",
"recommendation_type": "deploy_to_device",
"confidence": 0.87,
"daily_requests": 45200,
"avg_latency_ms": 12.3,
"estimated_monthly_savings_usd": 142.80,
"compatible_devices": 1823,
"unique_requesting_devices": 412,
"analysis_window_days": 7,
"created_at": "2026-02-19T08:00:00Z"
},
{
"model_id": "sentiment-v2",
"recommendation_type": "canary_test",
"confidence": 0.64,
"daily_requests": 8300,
"avg_latency_ms": 18.7,
"estimated_monthly_savings_usd": 28.05,
"compatible_devices": 945,
"unique_requesting_devices": 87,
"analysis_window_days": 7,
"created_at": "2026-02-19T08:00:00Z"
}
]
}

GET /api/v1/recommendations/{model_id}

Get the recommendation for a specific model.

curl -H "Authorization: Bearer <token>" \
https://api.octomil.com/api/v1/recommendations/text-classifier-v3

Response:

{
"model_id": "text-classifier-v3",
"recommendation_type": "deploy_to_device",
"confidence": 0.87,
"daily_requests": 45200,
"avg_latency_ms": 12.3,
"estimated_monthly_savings_usd": 142.80,
"compatible_devices": 1823,
"unique_requesting_devices": 412,
"analysis_window_days": 7,
"created_at": "2026-02-19T08:00:00Z",
"details": {
"volume_score": 0.95,
"diversity_score": 0.82,
"fleet_score": 0.78,
"cost_breakdown": {
"current_monthly_cloud_cost_usd": 203.99,
"projected_monthly_device_cost_usd": 61.19,
"savings_usd": 142.80,
"adoption_factor": 0.7
}
}
}

POST /api/v1/recommendations/{model_id}/deploy

Trigger a canary deployment for a recommended model. This creates a rollout that starts sending a percentage of traffic to on-device inference.

curl -X POST \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
https://api.octomil.com/api/v1/recommendations/text-classifier-v3/deploy

Response:

{
"rollout_id": "rol_a1b2c3d4",
"model_id": "text-classifier-v3",
"status": "canary",
"canary_percentage": 10,
"created_at": "2026-02-19T14:30:00Z"
}

The canary starts at 10% of eligible devices. Monitor the rollout in the Rollouts dashboard and promote or roll back as needed.

Dashboard Widget

The Monitoring page includes a Recommendations panel that shows:

  • Models with active recommendations, sorted by estimated savings
  • Recommendation type badge (deploy_to_device, canary_test, monitor)
  • Estimated monthly savings in USD
  • A Deploy button for models with deploy_to_device or canary_test recommendations

The Deploy button triggers the /deploy endpoint and creates a canary rollout directly from the dashboard.

Configuration

Configure the recommendation engine using environment variables.

VariableDefaultDescription
OCTOMIL_CLOUD_COST_PER_MS0.000003Cost in USD per millisecond of cloud inference compute
OCTOMIL_RECOMMEND_MIN_DAILY_REQUESTS1000Minimum daily requests before a model is evaluated
OCTOMIL_RECOMMEND_ADOPTION_FACTOR0.7Expected fraction of compatible devices that will run the on-device model
OCTOMIL_RECOMMEND_LOOKBACK_DAYS7Number of days of telemetry to analyze

Example:

export OCTOMIL_CLOUD_COST_PER_MS=0.000005
export OCTOMIL_RECOMMEND_MIN_DAILY_REQUESTS=500
export OCTOMIL_RECOMMEND_ADOPTION_FACTOR=0.6
export OCTOMIL_RECOMMEND_LOOKBACK_DAYS=14

Workflow

A typical workflow using recommendations:

  1. Enable telemetry on your inference server with octomil serve --api-key <your-api-key>.
  2. Wait for data -- the engine needs at least one lookback window of telemetry data.
  3. Review recommendations in the Monitoring dashboard or via the API.
  4. Deploy a canary for high-confidence recommendations using the Deploy button or the API.
  5. Monitor the canary in the Rollouts dashboard. Check error rates and latency on devices.
  6. Promote or roll back based on canary results.