Early Exit

Octomil can skip later transformer layers when a token's prediction is already confident. Easy tokens (common words, punctuation, predictable continuations) exit early and save compute. Hard tokens (reasoning, rare words, ambiguous context) use all layers. You get faster throughput without sacrificing quality where it matters.

Quick Start

octomil serve gemma-2b --early-exit-threshold 0.3

Startup output confirms early exit is active:

[engine] Selected: mlx (fastest)
[early-exit] Enabled: threshold=0.3, min_layers=50%
[serve] Listening on http://localhost:8080

Speed-Quality Presets

Instead of tuning the threshold manually, use a preset:

octomil serve gemma-2b --speed-quality balanced

Preset	Threshold	Min Layers	Use Case
`quality`	0.1	75%	Conservative — fewer early exits, minimal quality impact
`balanced`	0.3	50%	Good tradeoff for most workloads
`fast`	0.5	25%	Aggressive — more early exits, maximum speed

The threshold controls how confident the model must be before exiting. Lower values mean the model needs higher confidence to exit early (fewer exits, higher quality). Higher values allow exits at lower confidence (more exits, faster).

How It Works

At each transformer layer, Octomil checks the entropy of the token prediction distribution. Low entropy means the model is confident. If entropy falls below the threshold and the minimum layer count is met, the remaining layers are skipped.

This is not an approximation. On tokens where the model is already confident at layer 16 of 32, layers 17-32 would not meaningfully change the output. Early exit skips the redundant computation.

API Usage

Early exit is transparent to API callers. Use the standard chat completions endpoint:

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-2b",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }'

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="gemma-2b",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });

const response = await client.chat.completions.create({
  model: "gemma-2b",
  messages: [{ role: "user", content: "What is 2+2?" }],
});
console.log(response.choices[0].message.content);

Telemetry

When telemetry is enabled, early exit metrics appear as response headers:

X-Octomil-Early-Exit-Tokens: 42
X-Octomil-Avg-Layers-Used: 18.3
X-Octomil-Total-Layers: 32

These metrics are also visible in the Monitoring Dashboard under inference telemetry. The dashboard shows average layers used over time, helping you tune the threshold for your workload.

Configuration

Flag	Default	Description
`--early-exit-threshold`	disabled	Entropy threshold (0.0-1.0). Lower = fewer exits, higher quality.
`--speed-quality`	none	Preset: `quality`, `balanced`, or `fast`. Overrides threshold.

If both --early-exit-threshold and --speed-quality are provided, the preset takes precedence.

Works with Other Optimizations

Early exit combines with speculative decoding and structured decoding. All three can be active simultaneously. Early exit reduces per-token compute, speculative decoding reduces the number of forward passes, and structured decoding constrains the output format.

Gotchas

Not all models support early exit — the model must have intermediate layer outputs exposed. Models served via GGUF/llama.cpp may not support this. MLX models work best.
Threshold tuning is workload-dependent — a threshold that works well for factual Q&A may be too aggressive for creative writing. Benchmark with representative prompts before deploying.
Early exit metrics are per-request averages — the X-Octomil-Avg-Layers-Used header is averaged across all tokens in that request. Individual tokens may use very different layer counts.
Streaming and early exit — early exit works with streaming, but tokens with different exit points arrive at variable speeds. This can cause visible "bursts" in streaming output.
Disabling changes output — while early exit preserves quality for confident tokens, disabling it may produce slightly different outputs for borderline tokens. This is expected.

Local Inference — server setup
Speculative Decoding — automatic inference acceleration
Prompt Compression — reduce context window usage
Observability — monitor early exit metrics

Quick Start​

Speed-Quality Presets​

How It Works​

API Usage​

Telemetry​

Configuration​

Works with Other Optimizations​

Gotchas​

Related​