Speculative Decoding

Octomil automatically accelerates token generation by 1.5-3x on capable hardware. When you run octomil serve, speculative decoding activates if the device has enough memory -- no flags, no configuration, no draft model selection. You get faster output with identical quality.

Quick Start

octomil serve phi-4-mini

That's it. If your device has 6GB+ RAM, speculative decoding enables automatically. The startup output confirms:

[engine] Selected: mlx (fastest)
[speculative] Enabled: draft model loaded (452 MB)
[speculative] Method: draft-verify, lookahead: 7, adaptive: on
[serve] Listening on http://localhost:8080

Before and After

Standard decoding generates one token per forward pass. Speculative decoding predicts multiple tokens ahead and verifies them in a single pass, producing the same output faster.

Mode	Throughput (phi-4-mini, M3 Pro)	Time for 200 tokens
Standard	18 tok/s	11.1s
Speculative (auto)	44 tok/s	4.5s

The speedup varies by prompt. Predictable text (common phrases, structured output) sees higher gains; novel or highly technical content sees lower gains. Octomil adapts dynamically during generation.

How Speedup Is Reported

When telemetry is enabled, speculative metrics appear alongside standard inference metrics:

octomil serve phi-4-mini --api-key <your-api-key>

The /v1/chat/completions response includes speculative decoding headers:

X-Octomil-Speculative: enabled
X-Octomil-Acceptance-Rate: 0.74
X-Octomil-Effective-Speedup: 2.3x

These metrics are also visible in the Monitoring Dashboard under the inference telemetry section. The dashboard displays acceptance rate trends over time, helping you understand real-world performance across your fleet.

Device Requirements

Speculative decoding activates automatically based on available memory. No manual device checks needed.

Available RAM	Behavior
8 GB+	Full speculative decoding with larger draft model. Highest acceptance rate and speedup (2.2-3x).
6-8 GB	Speculative decoding with compact draft model. Good speedup (1.8-2.2x).
4-6 GB	Lightweight acceleration using pattern-based prediction. No additional model loaded. Modest speedup (1.3-1.5x).
< 4 GB	Standard decoding. Speculative decoding disabled to preserve memory for the main model.

Octomil evaluates available memory at startup (after loading the main model) and selects the strategy that fits. If memory pressure changes during inference, the runtime degrades gracefully.

Disabling Speculative Decoding

If you need deterministic single-token generation (for debugging or benchmarking), disable it:

octomil serve phi-4-mini --speculative off

You can also disable it per-request by adding a header:

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Octomil-Speculative: off" \
  -d '{
    "model": "phi-4-mini",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"X-Octomil-Speculative": "off"},
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });

const response = await client.chat.completions.create({
  model: "phi-4-mini",
  messages: [{ role: "user", content: "Hello" }],
}, { headers: { "X-Octomil-Speculative": "off" } });
console.log(response.choices[0].message.content);

Works with Structured Decoding

Speculative decoding and structured decoding work simultaneously. When you request JSON output via response_format, Octomil applies both optimizations -- schema-constrained token generation is still accelerated by speculative prediction.

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-4-mini",
    "messages": [{"role": "user", "content": "List 3 ML frameworks with descriptions."}],
    "response_format": {"type": "json_object"}
  }'

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "List 3 ML frameworks with descriptions."}],
    response_format={"type": "json_object"},
)

# Both speculative decoding AND JSON enforcement are active
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });

const response = await client.chat.completions.create({
  model: "phi-4-mini",
  messages: [{ role: "user", content: "List 3 ML frameworks with descriptions." }],
  response_format: { type: "json_object" },
});

// Both speculative decoding AND JSON enforcement are active
console.log(response.choices[0].message.content);

Benchmark Command

Run a standalone benchmark to see speculative decoding's impact on your hardware:

octomil benchmark phi-4-mini --speculative on
octomil benchmark phi-4-mini --speculative off

Example output:

Model:       phi-4-mini (3.8B)
Engine:      mlx
Device:      Apple M3 Pro (18 GB)

  Speculative ON:   44.2 tok/s  (acceptance rate: 74%)
  Speculative OFF:  18.1 tok/s
  Speedup:          2.44x

Share your results to improve recommendations for other users:

octomil benchmark phi-4-mini --share

This contributes anonymous performance data to the device profiling system.

Quality Guarantee

Speculative decoding produces identical output to standard decoding. It is not an approximation. The main model's probability distribution is preserved exactly -- the draft model only proposes candidates, and the main model accepts or rejects them using mathematically exact sampling. There is no quality-speed tradeoff.

Gotchas

Requires 6GB+ RAM — devices with less memory skip speculative decoding automatically. No error, just standard decoding.
Acceptance rate varies by prompt — creative/open-ended prompts have lower acceptance rates (~50-60%) than factual/structured prompts (~70-80%). Speedup scales with acceptance rate.
Draft model is downloaded automatically — the first run downloads a small draft model (~100MB). Subsequent runs use the cached version.
Not all engines support it — currently MLX and llama.cpp only. Other engines fall back to standard decoding silently.

Local Inference — server setup
Structured Decoding — guaranteed valid JSON output
Device Profiling — benchmark data from real devices
Observability — inference telemetry

Quick Start​

Before and After​

How Speedup Is Reported​

Device Requirements​

Disabling Speculative Decoding​

Works with Structured Decoding​

Benchmark Command​

Quality Guarantee​

Gotchas​

Related​