Speculative Decoding
Octomil automatically accelerates token generation by 1.5-3x on capable hardware. When you run octomil serve, speculative decoding activates if the device has enough memory -- no flags, no configuration, no draft model selection. You get faster output with identical quality.
Quick Start
octomil serve phi-4-mini
That's it. If your device has 6GB+ RAM, speculative decoding enables automatically. The startup output confirms:
[engine] Selected: mlx (fastest)
[speculative] Enabled: draft model loaded (452 MB)
[speculative] Method: draft-verify, lookahead: 7, adaptive: on
[serve] Listening on http://localhost:8080
Before and After
Standard decoding generates one token per forward pass. Speculative decoding predicts multiple tokens ahead and verifies them in a single pass, producing the same output faster.
| Mode | Throughput (phi-4-mini, M3 Pro) | Time for 200 tokens |
|---|---|---|
| Standard | 18 tok/s | 11.1s |
| Speculative (auto) | 44 tok/s | 4.5s |
The speedup varies by prompt. Predictable text (common phrases, structured output) sees higher gains; novel or highly technical content sees lower gains. Octomil adapts dynamically during generation.
How Speedup Is Reported
When telemetry is enabled, speculative metrics appear alongside standard inference metrics:
octomil serve phi-4-mini --api-key <your-api-key>
The /v1/chat/completions response includes speculative decoding headers:
X-Octomil-Speculative: enabled
X-Octomil-Acceptance-Rate: 0.74
X-Octomil-Effective-Speedup: 2.3x
These metrics are also visible in the Monitoring Dashboard under the inference telemetry section. The dashboard displays acceptance rate trends over time, helping you understand real-world performance across your fleet.
Device Requirements
Speculative decoding activates automatically based on available memory. No manual device checks needed.
| Available RAM | Behavior |
|---|---|
| 8 GB+ | Full speculative decoding with larger draft model. Highest acceptance rate and speedup (2.2-3x). |
| 6-8 GB | Speculative decoding with compact draft model. Good speedup (1.8-2.2x). |
| 4-6 GB | Lightweight acceleration using pattern-based prediction. No additional model loaded. Modest speedup (1.3-1.5x). |
| < 4 GB | Standard decoding. Speculative decoding disabled to preserve memory for the main model. |
Octomil evaluates available memory at startup (after loading the main model) and selects the strategy that fits. If memory pressure changes during inference, the runtime degrades gracefully.
Disabling Speculative Decoding
If you need deterministic single-token generation (for debugging or benchmarking), disable it:
octomil serve phi-4-mini --speculative off
You can also disable it per-request by adding a header:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Octomil-Speculative: off" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "Hello"}]
}'
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={"X-Octomil-Speculative": "off"},
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const response = await client.chat.completions.create({
model: "phi-4-mini",
messages: [{ role: "user", content: "Hello" }],
}, { headers: { "X-Octomil-Speculative": "off" } });
console.log(response.choices[0].message.content);
Works with Structured Decoding
Speculative decoding and structured decoding work simultaneously. When you request JSON output via response_format, Octomil applies both optimizations -- schema-constrained token generation is still accelerated by speculative prediction.
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-4-mini",
"messages": [{"role": "user", "content": "List 3 ML frameworks with descriptions."}],
"response_format": {"type": "json_object"}
}'
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "List 3 ML frameworks with descriptions."}],
response_format={"type": "json_object"},
)
# Both speculative decoding AND JSON enforcement are active
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const response = await client.chat.completions.create({
model: "phi-4-mini",
messages: [{ role: "user", content: "List 3 ML frameworks with descriptions." }],
response_format: { type: "json_object" },
});
// Both speculative decoding AND JSON enforcement are active
console.log(response.choices[0].message.content);
Benchmark Command
Run a standalone benchmark to see speculative decoding's impact on your hardware:
octomil benchmark phi-4-mini --speculative on
octomil benchmark phi-4-mini --speculative off
Example output:
Model: phi-4-mini (3.8B)
Engine: mlx
Device: Apple M3 Pro (18 GB)
Speculative ON: 44.2 tok/s (acceptance rate: 74%)
Speculative OFF: 18.1 tok/s
Speedup: 2.44x
Share your results to improve recommendations for other users:
octomil benchmark phi-4-mini --share
This contributes anonymous performance data to the device profiling system.
Quality Guarantee
Speculative decoding produces identical output to standard decoding. It is not an approximation. The main model's probability distribution is preserved exactly -- the draft model only proposes candidates, and the main model accepts or rejects them using mathematically exact sampling. There is no quality-speed tradeoff.
Gotchas
- Requires 6GB+ RAM — devices with less memory skip speculative decoding automatically. No error, just standard decoding.
- Acceptance rate varies by prompt — creative/open-ended prompts have lower acceptance rates (~50-60%) than factual/structured prompts (~70-80%). Speedup scales with acceptance rate.
- Draft model is downloaded automatically — the first run downloads a small draft model (~100MB). Subsequent runs use the cached version.
- Not all engines support it — currently MLX and llama.cpp only. Other engines fall back to standard decoding silently.
Related
- Local Inference — server setup
- Structured Decoding — guaranteed valid JSON output
- Device Profiling — benchmark data from real devices
- Observability — inference telemetry