Device Targeting
Octomil analyzes your inference telemetry to identify models that would benefit from on-device deployment. When a cloud-served model accumulates enough request volume and your device fleet has compatible hardware, Octomil generates a recommendation with a cost savings estimate and a confidence score.
How It Works
The recommendation engine runs periodically against your telemetry data and evaluates each model along three dimensions:
- Request volume -- models with high daily request counts have the most to gain from moving to device.
- Device diversity -- models requested from many distinct devices are good candidates because deployment reaches more users.
- Fleet size -- a larger fleet of compatible devices increases the potential cost savings.
Analysis Criteria
A model is evaluated for recommendation when it exceeds the minimum daily request threshold (default: 1,000 requests/day). The engine then:
- Calculates current cloud inference cost using the configured cost-per-millisecond rate
- Estimates on-device cost savings accounting for an adoption factor (not all devices will be online)
- Checks device compatibility against the model's size and format requirements
- Produces a confidence score and recommendation type
Recommendation Types
| Type | Confidence Range | Meaning |
|---|---|---|
deploy_to_device | > 0.8 | High confidence. The model is a strong candidate for on-device deployment. |
canary_test | 0.5 -- 0.8 | Moderate confidence. Recommend starting with a canary rollout to validate. |
monitor | < 0.5 | Low confidence. Continue collecting telemetry before deciding. |
Confidence Scoring
The confidence score is a weighted composite of three signals:
| Signal | Weight | What It Measures |
|---|---|---|
| Request volume | 40% | Daily request count relative to the minimum threshold. Higher volume increases confidence. |
| Device diversity | 30% | Number of unique devices requesting the model. More devices means broader deployment impact. |
| Fleet size | 30% | Total compatible devices in the fleet. Larger fleets amplify cost savings. |
The score is normalized to 0.0 -- 1.0. A score of 1.0 means the model has high volume, is requested from many devices, and the fleet is large enough to absorb the workload.
Cost Model
The cost savings estimate uses a straightforward formula:
estimated_monthly_savings = daily_requests * avg_latency_ms * cost_per_ms * 30 * adoption_factor
| Parameter | Default | Description |
|---|---|---|
cost_per_ms | $0.000003 | Cloud inference cost per millisecond of compute time |
adoption_factor | 0.7 | Fraction of devices expected to successfully adopt the on-device model |
These values are configurable. See Configuration below.
API Endpoints
GET /api/v1/recommendations
List all current recommendations.
- cURL
- Python
- JavaScript
curl -H "Authorization: Bearer <token>" \
https://api.octomil.com/api/v1/recommendations
import requests
response = requests.get(
"https://api.octomil.com/api/v1/recommendations",
headers={"Authorization": "Bearer <token>"},
)
print(response.json())
const response = await fetch(
"https://api.octomil.com/api/v1/recommendations",
{
headers: { "Authorization": "Bearer <token>" },
}
);
const data = await response.json();
console.log(data);
Response:
{
"recommendations": [
{
"model_id": "text-classifier-v3",
"recommendation_type": "deploy_to_device",
"confidence": 0.87,
"daily_requests": 45200,
"avg_latency_ms": 12.3,
"estimated_monthly_savings_usd": 142.80,
"compatible_devices": 1823,
"unique_requesting_devices": 412,
"analysis_window_days": 7,
"created_at": "2026-02-19T08:00:00Z"
},
{
"model_id": "sentiment-v2",
"recommendation_type": "canary_test",
"confidence": 0.64,
"daily_requests": 8300,
"avg_latency_ms": 18.7,
"estimated_monthly_savings_usd": 28.05,
"compatible_devices": 945,
"unique_requesting_devices": 87,
"analysis_window_days": 7,
"created_at": "2026-02-19T08:00:00Z"
}
]
}
GET /api/v1/recommendations/{model_id}
Get the recommendation for a specific model.
- cURL
- Python
- JavaScript
curl -H "Authorization: Bearer <token>" \
https://api.octomil.com/api/v1/recommendations/text-classifier-v3
import requests
response = requests.get(
"https://api.octomil.com/api/v1/recommendations/text-classifier-v3",
headers={"Authorization": "Bearer <token>"},
)
print(response.json())
const response = await fetch(
"https://api.octomil.com/api/v1/recommendations/text-classifier-v3",
{
headers: { "Authorization": "Bearer <token>" },
}
);
const data = await response.json();
console.log(data);
Response:
{
"model_id": "text-classifier-v3",
"recommendation_type": "deploy_to_device",
"confidence": 0.87,
"daily_requests": 45200,
"avg_latency_ms": 12.3,
"estimated_monthly_savings_usd": 142.80,
"compatible_devices": 1823,
"unique_requesting_devices": 412,
"analysis_window_days": 7,
"created_at": "2026-02-19T08:00:00Z",
"details": {
"volume_score": 0.95,
"diversity_score": 0.82,
"fleet_score": 0.78,
"cost_breakdown": {
"current_monthly_cloud_cost_usd": 203.99,
"projected_monthly_device_cost_usd": 61.19,
"savings_usd": 142.80,
"adoption_factor": 0.7
}
}
}
POST /api/v1/recommendations/{model_id}/deploy
Trigger a canary deployment for a recommended model. This creates a rollout that starts sending a percentage of traffic to on-device inference.
- cURL
- Python
- JavaScript
curl -X POST \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
https://api.octomil.com/api/v1/recommendations/text-classifier-v3/deploy
import requests
response = requests.post(
"https://api.octomil.com/api/v1/recommendations/text-classifier-v3/deploy",
headers={"Authorization": "Bearer <token>"},
)
print(response.json())
const response = await fetch(
"https://api.octomil.com/api/v1/recommendations/text-classifier-v3/deploy",
{
method: "POST",
headers: {
"Authorization": "Bearer <token>",
"Content-Type": "application/json",
},
}
);
const data = await response.json();
console.log(data);
Response:
{
"rollout_id": "rol_a1b2c3d4",
"model_id": "text-classifier-v3",
"status": "canary",
"canary_percentage": 10,
"created_at": "2026-02-19T14:30:00Z"
}
The canary starts at 10% of eligible devices. Monitor the rollout in the Rollouts dashboard and promote or roll back as needed.
Dashboard Widget
The Monitoring page includes a Recommendations panel that shows:
- Models with active recommendations, sorted by estimated savings
- Recommendation type badge (
deploy_to_device,canary_test,monitor) - Estimated monthly savings in USD
- A Deploy button for models with
deploy_to_deviceorcanary_testrecommendations
The Deploy button triggers the /deploy endpoint and creates a canary rollout directly from the dashboard.
Configuration
Configure the recommendation engine using environment variables.
| Variable | Default | Description |
|---|---|---|
OCTOMIL_CLOUD_COST_PER_MS | 0.000003 | Cost in USD per millisecond of cloud inference compute |
OCTOMIL_RECOMMEND_MIN_DAILY_REQUESTS | 1000 | Minimum daily requests before a model is evaluated |
OCTOMIL_RECOMMEND_ADOPTION_FACTOR | 0.7 | Expected fraction of compatible devices that will run the on-device model |
OCTOMIL_RECOMMEND_LOOKBACK_DAYS | 7 | Number of days of telemetry to analyze |
Example:
export OCTOMIL_CLOUD_COST_PER_MS=0.000005
export OCTOMIL_RECOMMEND_MIN_DAILY_REQUESTS=500
export OCTOMIL_RECOMMEND_ADOPTION_FACTOR=0.6
export OCTOMIL_RECOMMEND_LOOKBACK_DAYS=14
Workflow
A typical workflow using recommendations:
- Enable telemetry on your inference server with
octomil serve --api-key <your-api-key>. - Wait for data -- the engine needs at least one lookback window of telemetry data.
- Review recommendations in the Monitoring dashboard or via the API.
- Deploy a canary for high-confidence recommendations using the Deploy button or the API.
- Monitor the canary in the Rollouts dashboard. Check error rates and latency on devices.
- Promote or roll back based on canary results.
Related Docs
- Octomil Serve -- local inference server setup
- Telemetry and Observability -- how telemetry data is collected
- Model Rollouts -- canary and gradual deployment
- Monitoring Dashboard -- view recommendations in the dashboard