Skip to main content

Prompt Compression

Long system prompts and multi-turn conversations consume context window capacity and slow down prefill. Octomil compresses context automatically, removing low-value tokens or summarizing older conversation turns. The API interface is unchanged -- compression happens before the model sees the prompt.

Quick Start

octomil serve gemma-2b --compress-context

Startup output confirms compression is active:

[engine] Selected: mlx (fastest)
[compression] Enabled: strategy=token_pruning, ratio=0.5, min_tokens=256
[serve] Listening on http://localhost:8080

Token Pruning

Removes low-information tokens (articles, prepositions, filler words) from long prompts. Fast, requires no additional model.

octomil serve gemma-2b --compress-context --compression-strategy token_pruning --compression-ratio 0.5

The --compression-ratio controls how aggressively tokens are pruned. 0.5 removes roughly 50% of low-information tokens. Prompts shorter than --compression-threshold (default: 256 tokens) are not compressed.

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-2b",
"messages": [
{"role": "system", "content": "You are a helpful assistant specializing in machine learning. You have deep expertise in neural networks, transformers, and optimization algorithms. Always provide clear, concise explanations with practical examples."},
{"role": "user", "content": "Explain attention mechanisms."}
]
}'

The system prompt is compressed before reaching the model. The response is identical in meaning but arrives faster due to reduced prefill.

Sliding Window

Keeps the most recent conversation turns verbatim and compresses older turns into a compact summary. Ideal for multi-turn conversations that grow beyond the context window.

octomil serve gemma-2b --compress-context --compression-strategy sliding_window --compression-max-turns 4

With --compression-max-turns 4, the last 4 user/assistant exchanges are kept in full. Older turns are condensed into a summary prefix.

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-2b",
"messages": [
{"role": "user", "content": "What is federated learning?"},
{"role": "assistant", "content": "Federated learning is a technique where..."},
{"role": "user", "content": "How does it handle non-IID data?"},
{"role": "assistant", "content": "Non-IID data distributions are handled by..."},
{"role": "user", "content": "What about FedProx specifically?"},
{"role": "assistant", "content": "FedProx adds a proximal term..."},
{"role": "user", "content": "Compare FedProx to Scaffold."}
]
}'

In this example, the first 2 exchanges are compressed into a summary. The last 2 exchanges are passed verbatim to the model.

When to Use Which

StrategyBest ForRequires ModelTypical Reduction
Token PruningLong system prompts, single-turn queriesNo~50% fewer tokens
Sliding WindowMulti-turn conversationsNoKeeps last K turns, compresses rest

Token pruning works best when you have a long, descriptive system prompt that contains redundant phrasing. It removes filler without changing meaning.

Sliding window works best for chatbot-style applications where conversation history grows unbounded. Recent context stays sharp, older context is summarized.

Telemetry

Compression metrics appear as response headers:

X-Octomil-Compression: enabled
X-Octomil-Original-Tokens: 1842
X-Octomil-Compressed-Tokens: 923
X-Octomil-Compression-Ratio: 0.50

Configuration

FlagDefaultDescription
--compress-contextoffEnable prompt compression
--compression-strategytoken_pruningStrategy: token_pruning or sliding_window
--compression-ratio0.5Target compression ratio for token pruning (0.0-1.0)
--compression-threshold256Minimum token count before compression activates
--compression-max-turns4Turns to keep verbatim (sliding window only)

Gotchas

  • Token pruning can remove important context — pruning is heuristic. Domain-specific terms that look like filler may be removed. Test with representative prompts before deploying.
  • Sliding window summarization is lossy — older conversation turns are condensed, not preserved verbatim. If a user references something from 10 turns ago, the model may not recall the exact wording.
  • Compression ratio is approximate--compression-ratio 0.5 targets 50% reduction but the actual result depends on prompt content. Some prompts compress more than others.
  • Short prompts are not compressed — prompts below --compression-threshold (default: 256 tokens) pass through unchanged. This prevents over-compression of concise queries.
  • Compression adds prefill latency — the compression step itself takes time. For very short prompts, enabling compression may be slower than not compressing. The break-even point is typically around 500+ tokens.
  • Not compatible with exact reproducibility — compressed prompts produce different model inputs than uncompressed ones. If you need bit-exact output matching, disable compression.