Model Routing
Octomil can serve multiple models simultaneously and route each request to the most efficient model that can handle it. Simple queries go to a smaller model. Harder queries escalate to larger or higher-quality models. The result is lower average latency and lower compute cost without sacrificing quality on harder prompts.
The Problem
Running a large model for "What time is it in Tokyo?" wastes resources. But using a small model for everything can hurt quality on harder tasks. Manually tagging prompts by complexity is brittle and does not scale.
Octomil analyzes each query and routes it to an appropriate tier. You can keep the defaults or tune the routing policy for your workload.
Quick Start
octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route
Startup output shows the tier assignments:
[routing] Loading 3 models:
Tier 1 (fast): smollm-360m (360M params, ~2 tok/ms)
Tier 2 (balanced): phi-4-mini (3.8B params, ~0.8 tok/ms)
Tier 3 (quality): llama-3.2-3b (3B params, ~0.6 tok/ms)
[routing] Auto-routing enabled. Queries routed by complexity.
[routing] Tier-0 (deterministic): arithmetic, unit conversions
[serve] Listening on http://localhost:8080
Send requests to the same endpoint as usual. Routing stays transparent to the client:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "What is 2+2?"}]
}'
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const response = await client.chat.completions.create({
model: "auto",
messages: [{ role: "user", content: "What is 2+2?" }],
});
console.log(response.choices[0].message.content);
Routing Decisions
The response includes headers showing which model handled the request:
X-Octomil-Routed-Model: smollm-360m
X-Octomil-Routing-Tier: 1
X-Octomil-Routing-Latency-Us: 42
For the same server, a harder query routes to a larger model:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Explain the tradeoffs between FedAvg and FedProx for non-IID data distributions across heterogeneous edge devices."}]
}'
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Explain the tradeoffs between FedAvg and FedProx for non-IID data distributions across heterogeneous edge devices."}],
)
print(response.choices[0].message.content)
const response = await client.chat.completions.create({
model: "auto",
messages: [{ role: "user", content: "Explain the tradeoffs between FedAvg and FedProx for non-IID data distributions across heterogeneous edge devices." }],
});
console.log(response.choices[0].message.content);
X-Octomil-Routed-Model: llama-3.2-3b
X-Octomil-Routing-Tier: 3
X-Octomil-Routing-Latency-Us: 38
Requesting a Specific Model
You can bypass routing by specifying a model name directly:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "Hello"}]
}'
response = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "Hello"}],
)
const response = await client.chat.completions.create({
model: "phi-4-mini",
messages: [{ role: "user", content: "Hello" }],
});
When model is set to "auto", routing applies. When it is set to a specific model name, the request goes directly to that model.
Tier-0: Instant Answers
Simple arithmetic and unit conversions can be answered deterministically without invoking a model. This is the fastest tier and avoids model usage for queries that do not need it.
> "What is 15% of 240?"
< "36.0" (0.2ms, no model used)
> "Convert 72°F to Celsius"
< "22.22°C" (0.1ms, no model used)
Tier-0 uses safe AST-based evaluation rather than eval. It handles basic math, percentages, unit conversions, and simple expressions. Anything it cannot solve deterministically is passed to Tier 1 or higher.
Tier-0 is enabled automatically when --auto-route is set. No additional configuration needed.
Query Decomposition
Multi-part queries are automatically detected and split into subtasks. Each subtask is routed to the appropriate tier independently, and results are merged into a single response.
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Summarize the key points of attention mechanisms, then calculate 15% of 340, and translate hello world to French."}]
}'
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Summarize the key points of attention mechanisms, then calculate 15% of 340, and translate hello world to French."}],
)
print(response.choices[0].message.content)
const response = await client.chat.completions.create({
model: "auto",
messages: [{ role: "user", content: "Summarize the key points of attention mechanisms, then calculate 15% of 340, and translate hello world to French." }],
});
console.log(response.choices[0].message.content);
Response headers show how the query was decomposed:
X-Octomil-Decomposed: true
X-Octomil-Subtasks: 3
X-Octomil-Subtask-Models: llama-3.2-3b,tier0,smollm-360m
In this example:
- "Summarize attention mechanisms" → Tier 3 (llama-3.2-3b, complex reasoning)
- "Calculate 15% of 340" → Tier 0 (instant arithmetic, no model)
- "Translate hello world to French" → Tier 1 (smollm-360m, simple task)
The client sees a single merged response. Decomposition is transparent.
Complexity Scoring
The router evaluates query complexity using multiple signals:
| Signal | Low Complexity | High Complexity |
|---|---|---|
| Word count | Short, direct questions | Long, multi-part prompts |
| Technical vocabulary | Common words | Domain-specific jargon |
| Code indicators | No code references | Code snippets, debugging |
| Multi-turn depth | First message | Deep conversation history |
| System prompt | None or simple | Complex instructions |
These signals produce a complexity score between 0.0 and 1.0. Configurable thresholds map scores to tiers:
octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route --route-strategy complexity
Routing Statistics
Monitor routing distribution via the stats endpoint:
curl http://localhost:8080/v1/routing/stats
{
"total_requests": 15234,
"routing_distribution": {
"tier0_deterministic": {"count": 2104, "percentage": 13.8},
"smollm-360m": {"count": 7738, "percentage": 50.8},
"phi-4-mini": {"count": 3811, "percentage": 25.0},
"llama-3.2-3b": {"count": 1581, "percentage": 10.4}
},
"avg_routing_latency_us": 35,
"estimated_savings": {
"vs_largest_model_only": "64% fewer compute-seconds",
"avg_latency_reduction_ms": 82
}
}
In typical workloads, 60-80% of queries are handled by the smallest model or Tier 0. The routing overhead is negligible (under 50 microseconds per request).
Fallback Chain
If the routed model produces a low-confidence response, Octomil can escalate to the next tier automatically. Enable this with the fallback flag:
octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route --fallback
The escalation chain works bottom-up: Tier 1 to Tier 2 to Tier 3. If Tier 1 handles the query successfully, no escalation occurs. The client sees a single response -- the escalation is invisible.
Escalation adds latency (the failed attempt plus the retry), so it is disabled by default. Enable it when quality matters more than latency predictability.
Server-Side API: Routing Engine
The routing engine is also available as a server-side API for applications that need to make routing decisions programmatically. This is useful when your SDKs or devices need to decide where to run inference.
POST /api/v1/route
Request a routing decision for a model and device combination:
- cURL
- Python
- JavaScript
curl -X POST https://api.octomil.com/api/v1/route \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"model_id": "text-classifier",
"model_params": 500000000,
"model_size_mb": 250,
"device_capabilities": {
"platform": "ios",
"total_memory_mb": 8192,
"gpu_available": true,
"npu_available": true
},
"prefer": "device"
}'
import requests
response = requests.post(
"https://api.octomil.com/api/v1/route",
headers={"Authorization": "Bearer <token>"},
json={
"model_id": "text-classifier",
"model_params": 500000000,
"model_size_mb": 250,
"device_capabilities": {
"platform": "ios",
"total_memory_mb": 8192,
"gpu_available": True,
"npu_available": True,
},
"prefer": "device",
},
)
print(response.json())
const response = await fetch("https://api.octomil.com/api/v1/route", {
method: "POST",
headers: {
"Authorization": "Bearer <token>",
"Content-Type": "application/json",
},
body: JSON.stringify({
model_id: "text-classifier",
model_params: 500000000,
model_size_mb: 250,
device_capabilities: {
platform: "ios",
total_memory_mb: 8192,
gpu_available: true,
npu_available: true,
},
prefer: "device",
}),
});
const data = await response.json();
console.log(data);
Response:
{
"id": "rd_a1b2c3d4",
"target": "device",
"format": "coreml",
"engine": "ane",
"quantization": "int4",
"reason": "Device is capable, routing to on-device execution. Device tier=flagship, model=500,000,000 params / 250MB",
"fallback_target": {
"target": "cloud",
"endpoint": "/api/v1/inference",
"format": "onnx",
"engine": "onnx_runtime"
},
"device_class": "flagship",
"estimated_device_latency_ms": 75.0,
"estimated_cloud_latency_ms": 50.0,
"routing_latency_us": 28
}
The prefer field accepts "device", "cloud", "cheapest", or "fastest". Every response includes a cloud fallback configuration so your SDK can recover if the device can't handle the model at runtime.
GET /api/v1/route/stats/{model_id}
View aggregated routing statistics for a specific model across your fleet:
curl -H "Authorization: Bearer <token>" \
https://api.octomil.com/api/v1/route/stats/text-classifier
{
"model_id": "text-classifier",
"total_decisions": 8421,
"device_count": 7103,
"cloud_count": 1318,
"device_pct": 84.3,
"cloud_pct": 15.7,
"common_reasons": [
{"reason": "Device is capable, routing to on-device execution", "count": 7103},
{"reason": "Device RAM insufficient for safe inference", "count": 1318}
],
"avg_routing_latency_us": 31
}
Dashboard
Routing distribution is visible in the Monitoring Dashboard. The telemetry section shows:
- Requests per model tier over time
- Routing decision breakdown (device vs cloud, per-model)
- Escalation frequency (if fallback is enabled)
- Average routing latency
Gotchas
- Routing adds latency — routing decisions take ~30μs per request. Negligible for most workloads, but if you're already using a single model,
--auto-routeadds overhead with no benefit. - Complexity scoring is heuristic — the router uses token count, keyword detection, and query structure. It's not perfect. Override with
X-Modelheader when you know better. - All models must fit in memory —
--modelsloads all specified models at startup. Each model consumes VRAM/RAM. Check your available memory before loading 3+ models. - Tier-0 cache is per-server — if you run multiple serve instances, each has its own Tier-0 cache. There's no shared cache between instances.
Related
- Local Inference — server setup
- Early Exit — skip unnecessary transformer layers
- Device Profiling — benchmark data powering routing decisions
- Observability — monitor routing metrics
- Device Targeting — deployment recommendations