Skip to main content

Model Routing

Octomil can serve multiple models simultaneously and route each request to the most efficient model that can handle it. Simple queries go to a smaller model. Harder queries escalate to larger or higher-quality models. The result is lower average latency and lower compute cost without sacrificing quality on harder prompts.

The Problem

Running a large model for "What time is it in Tokyo?" wastes resources. But using a small model for everything can hurt quality on harder tasks. Manually tagging prompts by complexity is brittle and does not scale.

Octomil analyzes each query and routes it to an appropriate tier. You can keep the defaults or tune the routing policy for your workload.

Quick Start

octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route

Startup output shows the tier assignments:

[routing] Loading 3 models:
Tier 1 (fast): smollm-360m (360M params, ~2 tok/ms)
Tier 2 (balanced): phi-4-mini (3.8B params, ~0.8 tok/ms)
Tier 3 (quality): llama-3.2-3b (3B params, ~0.6 tok/ms)
[routing] Auto-routing enabled. Queries routed by complexity.
[routing] Tier-0 (deterministic): arithmetic, unit conversions
[serve] Listening on http://localhost:8080

Send requests to the same endpoint as usual. Routing stays transparent to the client:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "What is 2+2?"}]
}'

Routing Decisions

The response includes headers showing which model handled the request:

X-Octomil-Routed-Model: smollm-360m
X-Octomil-Routing-Tier: 1
X-Octomil-Routing-Latency-Us: 42

For the same server, a harder query routes to a larger model:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Explain the tradeoffs between FedAvg and FedProx for non-IID data distributions across heterogeneous edge devices."}]
}'
X-Octomil-Routed-Model: llama-3.2-3b
X-Octomil-Routing-Tier: 3
X-Octomil-Routing-Latency-Us: 38

Requesting a Specific Model

You can bypass routing by specifying a model name directly:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "Hello"}]
}'

When model is set to "auto", routing applies. When it is set to a specific model name, the request goes directly to that model.

Tier-0: Instant Answers

Simple arithmetic and unit conversions can be answered deterministically without invoking a model. This is the fastest tier and avoids model usage for queries that do not need it.

> "What is 15% of 240?"
< "36.0" (0.2ms, no model used)

> "Convert 72°F to Celsius"
< "22.22°C" (0.1ms, no model used)

Tier-0 uses safe AST-based evaluation rather than eval. It handles basic math, percentages, unit conversions, and simple expressions. Anything it cannot solve deterministically is passed to Tier 1 or higher.

Tier-0 is enabled automatically when --auto-route is set. No additional configuration needed.

Query Decomposition

Multi-part queries are automatically detected and split into subtasks. Each subtask is routed to the appropriate tier independently, and results are merged into a single response.

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Summarize the key points of attention mechanisms, then calculate 15% of 340, and translate hello world to French."}]
}'

Response headers show how the query was decomposed:

X-Octomil-Decomposed: true
X-Octomil-Subtasks: 3
X-Octomil-Subtask-Models: llama-3.2-3b,tier0,smollm-360m

In this example:

  • "Summarize attention mechanisms" → Tier 3 (llama-3.2-3b, complex reasoning)
  • "Calculate 15% of 340" → Tier 0 (instant arithmetic, no model)
  • "Translate hello world to French" → Tier 1 (smollm-360m, simple task)

The client sees a single merged response. Decomposition is transparent.

Complexity Scoring

The router evaluates query complexity using multiple signals:

SignalLow ComplexityHigh Complexity
Word countShort, direct questionsLong, multi-part prompts
Technical vocabularyCommon wordsDomain-specific jargon
Code indicatorsNo code referencesCode snippets, debugging
Multi-turn depthFirst messageDeep conversation history
System promptNone or simpleComplex instructions

These signals produce a complexity score between 0.0 and 1.0. Configurable thresholds map scores to tiers:

octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route --route-strategy complexity

Routing Statistics

Monitor routing distribution via the stats endpoint:

curl http://localhost:8080/v1/routing/stats
{
"total_requests": 15234,
"routing_distribution": {
"tier0_deterministic": {"count": 2104, "percentage": 13.8},
"smollm-360m": {"count": 7738, "percentage": 50.8},
"phi-4-mini": {"count": 3811, "percentage": 25.0},
"llama-3.2-3b": {"count": 1581, "percentage": 10.4}
},
"avg_routing_latency_us": 35,
"estimated_savings": {
"vs_largest_model_only": "64% fewer compute-seconds",
"avg_latency_reduction_ms": 82
}
}

In typical workloads, 60-80% of queries are handled by the smallest model or Tier 0. The routing overhead is negligible (under 50 microseconds per request).

Fallback Chain

If the routed model produces a low-confidence response, Octomil can escalate to the next tier automatically. Enable this with the fallback flag:

octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route --fallback

The escalation chain works bottom-up: Tier 1 to Tier 2 to Tier 3. If Tier 1 handles the query successfully, no escalation occurs. The client sees a single response -- the escalation is invisible.

Escalation adds latency (the failed attempt plus the retry), so it is disabled by default. Enable it when quality matters more than latency predictability.

Server-Side API: Routing Engine

The routing engine is also available as a server-side API for applications that need to make routing decisions programmatically. This is useful when your SDKs or devices need to decide where to run inference.

POST /api/v1/route

Request a routing decision for a model and device combination:

curl -X POST https://api.octomil.com/api/v1/route \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"model_id": "text-classifier",
"model_params": 500000000,
"model_size_mb": 250,
"device_capabilities": {
"platform": "ios",
"total_memory_mb": 8192,
"gpu_available": true,
"npu_available": true
},
"prefer": "device"
}'

Response:

{
"id": "rd_a1b2c3d4",
"target": "device",
"format": "coreml",
"engine": "ane",
"quantization": "int4",
"reason": "Device is capable, routing to on-device execution. Device tier=flagship, model=500,000,000 params / 250MB",
"fallback_target": {
"target": "cloud",
"endpoint": "/api/v1/inference",
"format": "onnx",
"engine": "onnx_runtime"
},
"device_class": "flagship",
"estimated_device_latency_ms": 75.0,
"estimated_cloud_latency_ms": 50.0,
"routing_latency_us": 28
}

The prefer field accepts "device", "cloud", "cheapest", or "fastest". Every response includes a cloud fallback configuration so your SDK can recover if the device can't handle the model at runtime.

GET /api/v1/route/stats/{model_id}

View aggregated routing statistics for a specific model across your fleet:

curl -H "Authorization: Bearer <token>" \
https://api.octomil.com/api/v1/route/stats/text-classifier
{
"model_id": "text-classifier",
"total_decisions": 8421,
"device_count": 7103,
"cloud_count": 1318,
"device_pct": 84.3,
"cloud_pct": 15.7,
"common_reasons": [
{"reason": "Device is capable, routing to on-device execution", "count": 7103},
{"reason": "Device RAM insufficient for safe inference", "count": 1318}
],
"avg_routing_latency_us": 31
}

Dashboard

Routing distribution is visible in the Monitoring Dashboard. The telemetry section shows:

  • Requests per model tier over time
  • Routing decision breakdown (device vs cloud, per-model)
  • Escalation frequency (if fallback is enabled)
  • Average routing latency

Gotchas

  • Routing adds latency — routing decisions take ~30μs per request. Negligible for most workloads, but if you're already using a single model, --auto-route adds overhead with no benefit.
  • Complexity scoring is heuristic — the router uses token count, keyword detection, and query structure. It's not perfect. Override with X-Model header when you know better.
  • All models must fit in memory--models loads all specified models at startup. Each model consumes VRAM/RAM. Check your available memory before loading 3+ models.
  • Tier-0 cache is per-server — if you run multiple serve instances, each has its own Tier-0 cache. There's no shared cache between instances.