Skip to main content

Early Exit

Octomil can skip later transformer layers when a token's prediction is already confident. Easy tokens (common words, punctuation, predictable continuations) exit early and save compute. Hard tokens (reasoning, rare words, ambiguous context) use all layers. You get faster throughput without sacrificing quality where it matters.

Quick Start

octomil serve gemma-2b --early-exit-threshold 0.3

Startup output confirms early exit is active:

[engine] Selected: mlx (fastest)
[early-exit] Enabled: threshold=0.3, min_layers=50%
[serve] Listening on http://localhost:8080

Speed-Quality Presets

Instead of tuning the threshold manually, use a preset:

octomil serve gemma-2b --speed-quality balanced
PresetThresholdMin LayersUse Case
quality0.175%Conservative — fewer early exits, minimal quality impact
balanced0.350%Good tradeoff for most workloads
fast0.525%Aggressive — more early exits, maximum speed

The threshold controls how confident the model must be before exiting. Lower values mean the model needs higher confidence to exit early (fewer exits, higher quality). Higher values allow exits at lower confidence (more exits, faster).

How It Works

At each transformer layer, Octomil checks the entropy of the token prediction distribution. Low entropy means the model is confident. If entropy falls below the threshold and the minimum layer count is met, the remaining layers are skipped.

This is not an approximation. On tokens where the model is already confident at layer 16 of 32, layers 17-32 would not meaningfully change the output. Early exit skips the redundant computation.

API Usage

Early exit is transparent to API callers. Use the standard chat completions endpoint:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-2b",
"messages": [{"role": "user", "content": "What is 2+2?"}]
}'

Telemetry

When telemetry is enabled, early exit metrics appear as response headers:

X-Octomil-Early-Exit-Tokens: 42
X-Octomil-Avg-Layers-Used: 18.3
X-Octomil-Total-Layers: 32

These metrics are also visible in the Monitoring Dashboard under inference telemetry. The dashboard shows average layers used over time, helping you tune the threshold for your workload.

Configuration

FlagDefaultDescription
--early-exit-thresholddisabledEntropy threshold (0.0-1.0). Lower = fewer exits, higher quality.
--speed-qualitynonePreset: quality, balanced, or fast. Overrides threshold.

If both --early-exit-threshold and --speed-quality are provided, the preset takes precedence.

Works with Other Optimizations

Early exit combines with speculative decoding and structured decoding. All three can be active simultaneously. Early exit reduces per-token compute, speculative decoding reduces the number of forward passes, and structured decoding constrains the output format.

Gotchas

  • Not all models support early exit — the model must have intermediate layer outputs exposed. Models served via GGUF/llama.cpp may not support this. MLX models work best.
  • Threshold tuning is workload-dependent — a threshold that works well for factual Q&A may be too aggressive for creative writing. Benchmark with representative prompts before deploying.
  • Early exit metrics are per-request averages — the X-Octomil-Avg-Layers-Used header is averaged across all tokens in that request. Individual tokens may use very different layer counts.
  • Streaming and early exit — early exit works with streaming, but tokens with different exit points arrive at variable speeds. This can cause visible "bursts" in streaming output.
  • Disabling changes output — while early exit preserves quality for confident tokens, disabling it may produce slightly different outputs for borderline tokens. This is expected.