Prompt Compression
Long system prompts and multi-turn conversations consume context window capacity and slow down prefill. Octomil compresses context automatically, removing low-value tokens or summarizing older conversation turns. The API interface is unchanged -- compression happens before the model sees the prompt.
Quick Start
octomil serve gemma-2b --compress-context
Startup output confirms compression is active:
[engine] Selected: mlx (fastest)
[compression] Enabled: strategy=token_pruning, ratio=0.5, min_tokens=256
[serve] Listening on http://localhost:8080
Token Pruning
Removes low-information tokens (articles, prepositions, filler words) from long prompts. Fast, requires no additional model.
octomil serve gemma-2b --compress-context --compression-strategy token_pruning --compression-ratio 0.5
The --compression-ratio controls how aggressively tokens are pruned. 0.5 removes roughly 50% of low-information tokens. Prompts shorter than --compression-threshold (default: 256 tokens) are not compressed.
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-2b",
"messages": [
{"role": "system", "content": "You are a helpful assistant specializing in machine learning. You have deep expertise in neural networks, transformers, and optimization algorithms. Always provide clear, concise explanations with practical examples."},
{"role": "user", "content": "Explain attention mechanisms."}
]
}'
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="gemma-2b",
messages=[
{"role": "system", "content": "You are a helpful assistant specializing in machine learning. You have deep expertise in neural networks, transformers, and optimization algorithms. Always provide clear, concise explanations with practical examples."},
{"role": "user", "content": "Explain attention mechanisms."},
],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const response = await client.chat.completions.create({
model: "gemma-2b",
messages: [
{ role: "system", content: "You are a helpful assistant specializing in machine learning. You have deep expertise in neural networks, transformers, and optimization algorithms. Always provide clear, concise explanations with practical examples." },
{ role: "user", content: "Explain attention mechanisms." },
],
});
console.log(response.choices[0].message.content);
The system prompt is compressed before reaching the model. The response is identical in meaning but arrives faster due to reduced prefill.
Sliding Window
Keeps the most recent conversation turns verbatim and compresses older turns into a compact summary. Ideal for multi-turn conversations that grow beyond the context window.
octomil serve gemma-2b --compress-context --compression-strategy sliding_window --compression-max-turns 4
With --compression-max-turns 4, the last 4 user/assistant exchanges are kept in full. Older turns are condensed into a summary prefix.
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-2b",
"messages": [
{"role": "user", "content": "What is federated learning?"},
{"role": "assistant", "content": "Federated learning is a technique where..."},
{"role": "user", "content": "How does it handle non-IID data?"},
{"role": "assistant", "content": "Non-IID data distributions are handled by..."},
{"role": "user", "content": "What about FedProx specifically?"},
{"role": "assistant", "content": "FedProx adds a proximal term..."},
{"role": "user", "content": "Compare FedProx to Scaffold."}
]
}'
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="gemma-2b",
messages=[
{"role": "user", "content": "What is federated learning?"},
{"role": "assistant", "content": "Federated learning is a technique where..."},
{"role": "user", "content": "How does it handle non-IID data?"},
{"role": "assistant", "content": "Non-IID data distributions are handled by..."},
{"role": "user", "content": "What about FedProx specifically?"},
{"role": "assistant", "content": "FedProx adds a proximal term..."},
{"role": "user", "content": "Compare FedProx to Scaffold."},
],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });
const response = await client.chat.completions.create({
model: "gemma-2b",
messages: [
{ role: "user", content: "What is federated learning?" },
{ role: "assistant", content: "Federated learning is a technique where..." },
{ role: "user", content: "How does it handle non-IID data?" },
{ role: "assistant", content: "Non-IID data distributions are handled by..." },
{ role: "user", content: "What about FedProx specifically?" },
{ role: "assistant", content: "FedProx adds a proximal term..." },
{ role: "user", content: "Compare FedProx to Scaffold." },
],
});
console.log(response.choices[0].message.content);
In this example, the first 2 exchanges are compressed into a summary. The last 2 exchanges are passed verbatim to the model.
When to Use Which
| Strategy | Best For | Requires Model | Typical Reduction |
|---|---|---|---|
| Token Pruning | Long system prompts, single-turn queries | No | ~50% fewer tokens |
| Sliding Window | Multi-turn conversations | No | Keeps last K turns, compresses rest |
Token pruning works best when you have a long, descriptive system prompt that contains redundant phrasing. It removes filler without changing meaning.
Sliding window works best for chatbot-style applications where conversation history grows unbounded. Recent context stays sharp, older context is summarized.
Telemetry
Compression metrics appear as response headers:
X-Octomil-Compression: enabled
X-Octomil-Original-Tokens: 1842
X-Octomil-Compressed-Tokens: 923
X-Octomil-Compression-Ratio: 0.50
Configuration
| Flag | Default | Description |
|---|---|---|
--compress-context | off | Enable prompt compression |
--compression-strategy | token_pruning | Strategy: token_pruning or sliding_window |
--compression-ratio | 0.5 | Target compression ratio for token pruning (0.0-1.0) |
--compression-threshold | 256 | Minimum token count before compression activates |
--compression-max-turns | 4 | Turns to keep verbatim (sliding window only) |
Gotchas
- Token pruning can remove important context — pruning is heuristic. Domain-specific terms that look like filler may be removed. Test with representative prompts before deploying.
- Sliding window summarization is lossy — older conversation turns are condensed, not preserved verbatim. If a user references something from 10 turns ago, the model may not recall the exact wording.
- Compression ratio is approximate —
--compression-ratio 0.5targets 50% reduction but the actual result depends on prompt content. Some prompts compress more than others. - Short prompts are not compressed — prompts below
--compression-threshold(default: 256 tokens) pass through unchanged. This prevents over-compression of concise queries. - Compression adds prefill latency — the compression step itself takes time. For very short prompts, enabling compression may be slower than not compressing. The break-even point is typically around 500+ tokens.
- Not compatible with exact reproducibility — compressed prompts produce different model inputs than uncompressed ones. If you need bit-exact output matching, disable compression.
Related
- Local Inference — server setup
- Early Exit — skip unnecessary transformer layers
- Speculative Decoding — automatic inference acceleration
- Observability — monitor compression metrics