Model Routing

Octomil can serve multiple models simultaneously and route each request to the most efficient model that can handle it. Simple queries go to a smaller model. Harder queries escalate to larger or higher-quality models. The result is lower average latency and lower compute cost without sacrificing quality on harder prompts.

The Problem

Running a large model for "What time is it in Tokyo?" wastes resources. But using a small model for everything can hurt quality on harder tasks. Manually tagging prompts by complexity is brittle and does not scale.

Octomil analyzes each query and routes it to an appropriate tier. You can keep the defaults or tune the routing policy for your workload.

Quick Start

octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route

Startup output shows the tier assignments:

[routing] Loading 3 models:
  Tier 1 (fast):    smollm-360m     (360M params, ~2 tok/ms)
  Tier 2 (balanced): phi-4-mini     (3.8B params, ~0.8 tok/ms)
  Tier 3 (quality):  llama-3.2-3b   (3B params, ~0.6 tok/ms)
[routing] Auto-routing enabled. Queries routed by complexity.
[routing] Tier-0 (deterministic): arithmetic, unit conversions
[serve] Listening on http://localhost:8080

Send requests to the same endpoint as usual. Routing stays transparent to the client:

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }'

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });

const response = await client.chat.completions.create({
  model: "auto",
  messages: [{ role: "user", content: "What is 2+2?" }],
});
console.log(response.choices[0].message.content);

Routing Decisions

The response includes headers showing which model handled the request:

X-Octomil-Routed-Model: smollm-360m
X-Octomil-Routing-Tier: 1
X-Octomil-Routing-Latency-Us: 42

For the same server, a harder query routes to a larger model:

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Explain the tradeoffs between FedAvg and FedProx for non-IID data distributions across heterogeneous edge devices."}]
  }'

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Explain the tradeoffs between FedAvg and FedProx for non-IID data distributions across heterogeneous edge devices."}],
)
print(response.choices[0].message.content)

const response = await client.chat.completions.create({
  model: "auto",
  messages: [{ role: "user", content: "Explain the tradeoffs between FedAvg and FedProx for non-IID data distributions across heterogeneous edge devices." }],
});
console.log(response.choices[0].message.content);

X-Octomil-Routed-Model: llama-3.2-3b
X-Octomil-Routing-Tier: 3
X-Octomil-Routing-Latency-Us: 38

Requesting a Specific Model

You can bypass routing by specifying a model name directly:

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-4-mini",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Hello"}],
)

const response = await client.chat.completions.create({
  model: "phi-4-mini",
  messages: [{ role: "user", content: "Hello" }],
});

When model is set to "auto", routing applies. When it is set to a specific model name, the request goes directly to that model.

Tier-0: Instant Answers

Simple arithmetic and unit conversions can be answered deterministically without invoking a model. This is the fastest tier and avoids model usage for queries that do not need it.

> "What is 15% of 240?"
< "36.0"   (0.2ms, no model used)

> "Convert 72°F to Celsius"
< "22.22°C"   (0.1ms, no model used)

Tier-0 uses safe AST-based evaluation rather than eval. It handles basic math, percentages, unit conversions, and simple expressions. Anything it cannot solve deterministically is passed to Tier 1 or higher.

Tier-0 is enabled automatically when --auto-route is set. No additional configuration needed.

Query Decomposition

Multi-part queries are automatically detected and split into subtasks. Each subtask is routed to the appropriate tier independently, and results are merged into a single response.

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Summarize the key points of attention mechanisms, then calculate 15% of 340, and translate hello world to French."}]
  }'

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Summarize the key points of attention mechanisms, then calculate 15% of 340, and translate hello world to French."}],
)
print(response.choices[0].message.content)

const response = await client.chat.completions.create({
  model: "auto",
  messages: [{ role: "user", content: "Summarize the key points of attention mechanisms, then calculate 15% of 340, and translate hello world to French." }],
});
console.log(response.choices[0].message.content);

Response headers show how the query was decomposed:

X-Octomil-Decomposed: true
X-Octomil-Subtasks: 3
X-Octomil-Subtask-Models: llama-3.2-3b,tier0,smollm-360m

In this example:

"Summarize attention mechanisms" → Tier 3 (llama-3.2-3b, complex reasoning)
"Calculate 15% of 340" → Tier 0 (instant arithmetic, no model)
"Translate hello world to French" → Tier 1 (smollm-360m, simple task)

The client sees a single merged response. Decomposition is transparent.

Complexity Scoring

The router evaluates query complexity using multiple signals:

Signal	Low Complexity	High Complexity
Word count	Short, direct questions	Long, multi-part prompts
Technical vocabulary	Common words	Domain-specific jargon
Code indicators	No code references	Code snippets, debugging
Multi-turn depth	First message	Deep conversation history
System prompt	None or simple	Complex instructions

These signals produce a complexity score between 0.0 and 1.0. Configurable thresholds map scores to tiers:

octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route --route-strategy complexity

Routing Statistics

Monitor routing distribution via the stats endpoint:

curl http://localhost:8080/v1/routing/stats

{
  "total_requests": 15234,
  "routing_distribution": {
    "tier0_deterministic": {"count": 2104, "percentage": 13.8},
    "smollm-360m": {"count": 7738, "percentage": 50.8},
    "phi-4-mini": {"count": 3811, "percentage": 25.0},
    "llama-3.2-3b": {"count": 1581, "percentage": 10.4}
  },
  "avg_routing_latency_us": 35,
  "estimated_savings": {
    "vs_largest_model_only": "64% fewer compute-seconds",
    "avg_latency_reduction_ms": 82
  }
}

In typical workloads, 60-80% of queries are handled by the smallest model or Tier 0. The routing overhead is negligible (under 50 microseconds per request).

Fallback Chain

If the routed model produces a low-confidence response, Octomil can escalate to the next tier automatically. Enable this with the fallback flag:

octomil serve --models smollm-360m,phi-4-mini,llama-3.2-3b --auto-route --fallback

The escalation chain works bottom-up: Tier 1 to Tier 2 to Tier 3. If Tier 1 handles the query successfully, no escalation occurs. The client sees a single response -- the escalation is invisible.

Escalation adds latency (the failed attempt plus the retry), so it is disabled by default. Enable it when quality matters more than latency predictability.

Server-Side API: Routing Engine

The routing engine is also available as a server-side API for applications that need to make routing decisions programmatically. This is useful when your SDKs or devices need to decide where to run inference.

POST /api/v1/route

Request a routing decision for a model and device combination:

cURL
Python
JavaScript

curl -X POST https://api.octomil.com/api/v1/route \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "text-classifier",
    "model_params": 500000000,
    "model_size_mb": 250,
    "device_capabilities": {
      "platform": "ios",
      "total_memory_mb": 8192,
      "gpu_available": true,
      "npu_available": true
    },
    "prefer": "device"
  }'

import requests

response = requests.post(
    "https://api.octomil.com/api/v1/route",
    headers={"Authorization": "Bearer <token>"},
    json={
        "model_id": "text-classifier",
        "model_params": 500000000,
        "model_size_mb": 250,
        "device_capabilities": {
            "platform": "ios",
            "total_memory_mb": 8192,
            "gpu_available": True,
            "npu_available": True,
        },
        "prefer": "device",
    },
)
print(response.json())

const response = await fetch("https://api.octomil.com/api/v1/route", {
  method: "POST",
  headers: {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model_id: "text-classifier",
    model_params: 500000000,
    model_size_mb: 250,
    device_capabilities: {
      platform: "ios",
      total_memory_mb: 8192,
      gpu_available: true,
      npu_available: true,
    },
    prefer: "device",
  }),
});
const data = await response.json();
console.log(data);

Response:

{
  "id": "rd_a1b2c3d4",
  "target": "device",
  "format": "coreml",
  "engine": "ane",
  "quantization": "int4",
  "reason": "Device is capable, routing to on-device execution. Device tier=flagship, model=500,000,000 params / 250MB",
  "fallback_target": {
    "target": "cloud",
    "endpoint": "/api/v1/inference",
    "format": "onnx",
    "engine": "onnx_runtime"
  },
  "device_class": "flagship",
  "estimated_device_latency_ms": 75.0,
  "estimated_cloud_latency_ms": 50.0,
  "routing_latency_us": 28
}

The prefer field accepts "device", "cloud", "cheapest", or "fastest". Every response includes a cloud fallback configuration so your SDK can recover if the device can't handle the model at runtime.

GET /api/v1/route/stats/{model_id}

View aggregated routing statistics for a specific model across your fleet:

curl -H "Authorization: Bearer <token>" \
  https://api.octomil.com/api/v1/route/stats/text-classifier

{
  "model_id": "text-classifier",
  "total_decisions": 8421,
  "device_count": 7103,
  "cloud_count": 1318,
  "device_pct": 84.3,
  "cloud_pct": 15.7,
  "common_reasons": [
    {"reason": "Device is capable, routing to on-device execution", "count": 7103},
    {"reason": "Device RAM insufficient for safe inference", "count": 1318}
  ],
  "avg_routing_latency_us": 31
}

Dashboard

Routing distribution is visible in the Monitoring Dashboard. The telemetry section shows:

Requests per model tier over time
Routing decision breakdown (device vs cloud, per-model)
Escalation frequency (if fallback is enabled)
Average routing latency

Gotchas

Routing adds latency — routing decisions take ~30μs per request. Negligible for most workloads, but if you're already using a single model, --auto-route adds overhead with no benefit.
Complexity scoring is heuristic — the router uses token count, keyword detection, and query structure. It's not perfect. Override with X-Model header when you know better.
All models must fit in memory — --models loads all specified models at startup. Each model consumes VRAM/RAM. Check your available memory before loading 3+ models.
Tier-0 cache is per-server — if you run multiple serve instances, each has its own Tier-0 cache. There's no shared cache between instances.

Local Inference — server setup
Early Exit — skip unnecessary transformer layers
Device Profiling — benchmark data powering routing decisions
Observability — monitor routing metrics
Device Targeting — deployment recommendations

The Problem​

Quick Start​

Routing Decisions​

Requesting a Specific Model​

Tier-0: Instant Answers​

Query Decomposition​

Complexity Scoring​

Routing Statistics​

Fallback Chain​

Server-Side API: Routing Engine​

POST /api/v1/route​

GET /api/v1/route/stats/{model_id}​

Dashboard​

Gotchas​

Related​