Early Exit
Octomil can skip later transformer layers when a token's prediction is already confident. Easy tokens (common words, punctuation, predictable continuations) exit early and save compute. Hard tokens (reasoning, rare words, ambiguous context) use all layers. You get faster throughput without sacrificing quality where it matters.
Quick Start
octomil serve gemma-2b --early-exit-threshold 0.3
Startup output confirms early exit is active:
[engine] Selected: mlx (fastest)
[early-exit] Enabled: threshold=0.3, min_layers=50%
[serve] Listening on http://localhost:8080
Speed-Quality Presets
Instead of tuning the threshold manually, use a preset:
octomil serve gemma-2b --speed-quality balanced
| Preset | Threshold | Min Layers | Use Case |
|---|---|---|---|
quality | 0.1 | 75% | Conservative — fewer early exits, minimal quality impact |
balanced | 0.3 | 50% | Good tradeoff for most workloads |
fast | 0.5 | 25% | Aggressive — more early exits, maximum speed |
The threshold controls how confident the model must be before exiting. Lower values mean the model needs higher confidence to exit early (fewer exits, higher quality). Higher values allow exits at lower confidence (more exits, faster).
How It Works
At each transformer layer, Octomil checks the entropy of the token prediction distribution. Low entropy means the model is confident. If entropy falls below the threshold and the minimum layer count is met, the remaining layers are skipped.
This is not an approximation. On tokens where the model is already confident at layer 16 of 32, layers 17-32 would not meaningfully change the output. Early exit skips the redundant computation.
API Usage
Early exit is transparent to API callers. Use the standard chat completions endpoint:
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-2b",
"messages": [{"role": "user", "content": "What is 2+2?"}]
}'
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="gemma-2b",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const response = await client.chat.completions.create({
model: "gemma-2b",
messages: [{ role: "user", content: "What is 2+2?" }],
});
console.log(response.choices[0].message.content);
Telemetry
When telemetry is enabled, early exit metrics appear as response headers:
X-Octomil-Early-Exit-Tokens: 42
X-Octomil-Avg-Layers-Used: 18.3
X-Octomil-Total-Layers: 32
These metrics are also visible in the Monitoring Dashboard under inference telemetry. The dashboard shows average layers used over time, helping you tune the threshold for your workload.
Configuration
| Flag | Default | Description |
|---|---|---|
--early-exit-threshold | disabled | Entropy threshold (0.0-1.0). Lower = fewer exits, higher quality. |
--speed-quality | none | Preset: quality, balanced, or fast. Overrides threshold. |
If both --early-exit-threshold and --speed-quality are provided, the preset takes precedence.
Works with Other Optimizations
Early exit combines with speculative decoding and structured decoding. All three can be active simultaneously. Early exit reduces per-token compute, speculative decoding reduces the number of forward passes, and structured decoding constrains the output format.
Gotchas
- Not all models support early exit — the model must have intermediate layer outputs exposed. Models served via GGUF/llama.cpp may not support this. MLX models work best.
- Threshold tuning is workload-dependent — a threshold that works well for factual Q&A may be too aggressive for creative writing. Benchmark with representative prompts before deploying.
- Early exit metrics are per-request averages — the
X-Octomil-Avg-Layers-Usedheader is averaged across all tokens in that request. Individual tokens may use very different layer counts. - Streaming and early exit — early exit works with streaming, but tokens with different exit points arrive at variable speeds. This can cause visible "bursts" in streaming output.
- Disabling changes output — while early exit preserves quality for confident tokens, disabling it may produce slightly different outputs for borderline tokens. This is expected.
Related
- Local Inference — server setup
- Speculative Decoding — automatic inference acceleration
- Prompt Compression — reduce context window usage
- Observability — monitor early exit metrics