Skip to main content

Model Routing

Octomil can serve multiple models simultaneously and route each request to the most efficient model capable of handling it. Simple queries go to a small, fast model. Complex queries escalate to larger models. You save compute and latency on the majority of traffic without sacrificing quality on hard queries.

The Problem

Running a 7B model for "What time is it in Tokyo?" wastes resources. But you can't just deploy a small model for everything -- it fails on complex reasoning tasks. Manually tagging query complexity is brittle and doesn't scale.

Octomil analyzes each incoming query and routes it to the smallest model that can produce a quality response. No manual rules, no classification labels.

Quick Start

octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route

Startup output shows the tier assignments:

[routing] Loading 3 models:
Tier 1 (fast): smollm-360m (360M params, ~2 tok/ms)
Tier 2 (balanced): phi-4-mini (3.8B params, ~0.8 tok/ms)
Tier 3 (quality): llama-3.2-3b (3B params, ~0.6 tok/ms)
[routing] Auto-routing enabled. Queries routed by complexity.
[routing] Tier-0 (deterministic): arithmetic, unit conversions
[serve] Listening on http://localhost:8080

Send requests to the same endpoint as usual -- routing is transparent:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "What is 2+2?"}]
}'

Routing Decisions

The response includes headers showing which model handled the request:

X-Octomil-Routed-Model: smollm-360m
X-Octomil-Routing-Tier: 1
X-Octomil-Routing-Latency-Us: 42

For the same server, a harder query routes to a larger model:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Explain the tradeoffs between FedAvg and FedProx for non-IID data distributions across heterogeneous edge devices."}]
}'
X-Octomil-Routed-Model: llama-3.2-3b
X-Octomil-Routing-Tier: 3
X-Octomil-Routing-Latency-Us: 38

Requesting a Specific Model

You can bypass routing by specifying a model name directly:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "Hello"}]
}'

When model is set to "auto", routing applies. When set to a specific model name, the request goes directly to that model.

Tier-0: Instant Answers

Simple arithmetic and unit conversions are answered instantly without invoking any model. This is the fastest tier -- responses arrive in under 1 millisecond.

> "What is 15% of 240?"
< "36.0" (0.2ms, no model used)

> "Convert 72°F to Celsius"
< "22.22°C" (0.1ms, no model used)

Tier-0 uses safe AST-based evaluation (no eval). It handles basic math, percentages, unit conversions, and simple expressions. Anything it can't solve deterministically is passed to Tier 1+.

Tier-0 is enabled automatically when --auto-route is set. No additional configuration needed.

Query Decomposition

Multi-part queries are automatically detected and split into subtasks. Each subtask is routed to the appropriate tier independently, and results are merged into a single response.

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Summarize the key points of attention mechanisms, then calculate 15% of 340, and translate hello world to French."}]
}'

Response headers show how the query was decomposed:

X-Octomil-Decomposed: true
X-Octomil-Subtasks: 3
X-Octomil-Subtask-Models: llama-3.2-3b,tier0,smollm-360m

In this example:

  • "Summarize attention mechanisms" → Tier 3 (llama-3.2-3b, complex reasoning)
  • "Calculate 15% of 340" → Tier 0 (instant arithmetic, no model)
  • "Translate hello world to French" → Tier 1 (smollm-360m, simple task)

The client sees a single merged response. Decomposition is transparent.

Complexity Scoring

The router evaluates query complexity using multiple signals:

SignalLow ComplexityHigh Complexity
Word countShort, direct questionsLong, multi-part prompts
Technical vocabularyCommon wordsDomain-specific jargon
Code indicatorsNo code referencesCode snippets, debugging
Multi-turn depthFirst messageDeep conversation history
System promptNone or simpleComplex instructions

These signals produce a complexity score between 0.0 and 1.0. Configurable thresholds map scores to tiers:

octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route --route-strategy complexity

Routing Statistics

Monitor routing distribution via the stats endpoint:

curl http://localhost:8080/v1/routing/stats
{
"total_requests": 15234,
"routing_distribution": {
"tier0_deterministic": {"count": 2104, "percentage": 13.8},
"smollm-360m": {"count": 7738, "percentage": 50.8},
"phi-4-mini": {"count": 3811, "percentage": 25.0},
"llama-3.2-3b": {"count": 1581, "percentage": 10.4}
},
"avg_routing_latency_us": 35,
"estimated_savings": {
"vs_largest_model_only": "64% fewer compute-seconds",
"avg_latency_reduction_ms": 82
}
}

In typical workloads, 60-80% of queries are handled by the smallest model or Tier 0. The routing overhead is negligible (under 50 microseconds per request).

Fallback Chain

If the routed model produces a low-confidence response, Octomil can escalate to the next tier automatically. Enable this with the fallback flag:

octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route --fallback

The escalation chain works bottom-up: Tier 1 to Tier 2 to Tier 3. If Tier 1 handles the query successfully, no escalation occurs. The client sees a single response -- the escalation is invisible.

Escalation adds latency (the failed attempt plus the retry), so it is disabled by default. Enable it when quality matters more than latency predictability.

Server-Side API: Routing Engine

The routing engine is also available as a server-side API for applications that need to make routing decisions programmatically. This is useful when your SDKs or devices need to decide where to run inference.

POST /api/v1/route

Request a routing decision for a model and device combination:

curl -X POST https://api.octomil.com/api/v1/route \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"model_id": "text-classifier",
"model_params": 500000000,
"model_size_mb": 250,
"device_capabilities": {
"platform": "ios",
"total_memory_mb": 8192,
"gpu_available": true,
"npu_available": true
},
"prefer": "device"
}'

Response:

{
"id": "rd_a1b2c3d4",
"target": "device",
"format": "coreml",
"engine": "ane",
"quantization": "int4",
"reason": "Device is capable, routing to on-device execution. Device tier=flagship, model=500,000,000 params / 250MB",
"fallback_target": {
"target": "cloud",
"endpoint": "/api/v1/inference",
"format": "onnx",
"engine": "onnx_runtime"
},
"device_tier": "flagship",
"estimated_device_latency_ms": 75.0,
"estimated_cloud_latency_ms": 50.0,
"routing_latency_us": 28
}

The prefer field accepts "device", "cloud", "cheapest", or "fastest". Every response includes a cloud fallback configuration so your SDK can recover if the device can't handle the model at runtime.

GET /api/v1/route/stats/{model_id}

View aggregated routing statistics for a specific model across your fleet:

curl -H "Authorization: Bearer <token>" \
https://api.octomil.com/api/v1/route/stats/text-classifier
{
"model_id": "text-classifier",
"total_decisions": 8421,
"device_count": 7103,
"cloud_count": 1318,
"device_pct": 84.3,
"cloud_pct": 15.7,
"common_reasons": [
{"reason": "Device is capable, routing to on-device execution", "count": 7103},
{"reason": "Device RAM insufficient for safe inference", "count": 1318}
],
"avg_routing_latency_us": 31
}

Dashboard

Routing distribution is visible in the Monitoring Dashboard. The telemetry section shows:

  • Requests per model tier over time
  • Routing decision breakdown (device vs cloud, per-model)
  • Escalation frequency (if fallback is enabled)
  • Average routing latency

Gotchas

  • Routing adds latency — routing decisions take ~30μs per request. Negligible for most workloads, but if you're already using a single model, --auto-route adds overhead with no benefit.
  • Complexity scoring is heuristic — the router uses token count, keyword detection, and query structure. It's not perfect. Override with X-Model header when you know better.
  • All models must fit in memory--models loads all specified models at startup. Each model consumes VRAM/RAM. Check your available memory before loading 3+ models.
  • Tier-0 cache is per-server — if you run multiple serve instances, each has its own Tier-0 cache. There's no shared cache between instances.