Prompt Compression

Long system prompts and multi-turn conversations consume context window capacity and slow down prefill. Octomil compresses context automatically, removing low-value tokens or summarizing older conversation turns. The API interface is unchanged -- compression happens before the model sees the prompt.

Quick Start

octomil serve gemma-2b --compress-context

Startup output confirms compression is active:

[engine] Selected: mlx (fastest)
[compression] Enabled: strategy=token_pruning, ratio=0.5, min_tokens=256
[serve] Listening on http://localhost:8080

Token Pruning

Removes low-information tokens (articles, prepositions, filler words) from long prompts. Fast, requires no additional model.

octomil serve gemma-2b --compress-context --compression-strategy token_pruning --compression-ratio 0.5

The --compression-ratio controls how aggressively tokens are pruned. 0.5 removes roughly 50% of low-information tokens. Prompts shorter than --compression-threshold (default: 256 tokens) are not compressed.

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-2b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant specializing in machine learning. You have deep expertise in neural networks, transformers, and optimization algorithms. Always provide clear, concise explanations with practical examples."},
      {"role": "user", "content": "Explain attention mechanisms."}
    ]
  }'

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="gemma-2b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant specializing in machine learning. You have deep expertise in neural networks, transformers, and optimization algorithms. Always provide clear, concise explanations with practical examples."},
        {"role": "user", "content": "Explain attention mechanisms."},
    ],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });

const response = await client.chat.completions.create({
  model: "gemma-2b",
  messages: [
    { role: "system", content: "You are a helpful assistant specializing in machine learning. You have deep expertise in neural networks, transformers, and optimization algorithms. Always provide clear, concise explanations with practical examples." },
    { role: "user", content: "Explain attention mechanisms." },
  ],
});
console.log(response.choices[0].message.content);

The system prompt is compressed before reaching the model. The response is identical in meaning but arrives faster due to reduced prefill.

Sliding Window

Keeps the most recent conversation turns verbatim and compresses older turns into a compact summary. Ideal for multi-turn conversations that grow beyond the context window.

octomil serve gemma-2b --compress-context --compression-strategy sliding_window --compression-max-turns 4

With --compression-max-turns 4, the last 4 user/assistant exchanges are kept in full. Older turns are condensed into a summary prefix.

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-2b",
    "messages": [
      {"role": "user", "content": "What is federated learning?"},
      {"role": "assistant", "content": "Federated learning is a technique where..."},
      {"role": "user", "content": "How does it handle non-IID data?"},
      {"role": "assistant", "content": "Non-IID data distributions are handled by..."},
      {"role": "user", "content": "What about FedProx specifically?"},
      {"role": "assistant", "content": "FedProx adds a proximal term..."},
      {"role": "user", "content": "Compare FedProx to Scaffold."}
    ]
  }'

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="gemma-2b",
    messages=[
        {"role": "user", "content": "What is federated learning?"},
        {"role": "assistant", "content": "Federated learning is a technique where..."},
        {"role": "user", "content": "How does it handle non-IID data?"},
        {"role": "assistant", "content": "Non-IID data distributions are handled by..."},
        {"role": "user", "content": "What about FedProx specifically?"},
        {"role": "assistant", "content": "FedProx adds a proximal term..."},
        {"role": "user", "content": "Compare FedProx to Scaffold."},
    ],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "not-needed" });

const response = await client.chat.completions.create({
  model: "gemma-2b",
  messages: [
    { role: "user", content: "What is federated learning?" },
    { role: "assistant", content: "Federated learning is a technique where..." },
    { role: "user", content: "How does it handle non-IID data?" },
    { role: "assistant", content: "Non-IID data distributions are handled by..." },
    { role: "user", content: "What about FedProx specifically?" },
    { role: "assistant", content: "FedProx adds a proximal term..." },
    { role: "user", content: "Compare FedProx to Scaffold." },
  ],
});
console.log(response.choices[0].message.content);

In this example, the first 2 exchanges are compressed into a summary. The last 2 exchanges are passed verbatim to the model.

When to Use Which

Strategy	Best For	Requires Model	Typical Reduction
Token Pruning	Long system prompts, single-turn queries	No	~50% fewer tokens
Sliding Window	Multi-turn conversations	No	Keeps last K turns, compresses rest

Token pruning works best when you have a long, descriptive system prompt that contains redundant phrasing. It removes filler without changing meaning.

Sliding window works best for chatbot-style applications where conversation history grows unbounded. Recent context stays sharp, older context is summarized.

Telemetry

Compression metrics appear as response headers:

X-Octomil-Compression: enabled
X-Octomil-Original-Tokens: 1842
X-Octomil-Compressed-Tokens: 923
X-Octomil-Compression-Ratio: 0.50

Configuration

Flag	Default	Description
`--compress-context`	off	Enable prompt compression
`--compression-strategy`	`token_pruning`	Strategy: `token_pruning` or `sliding_window`
`--compression-ratio`	`0.5`	Target compression ratio for token pruning (0.0-1.0)
`--compression-threshold`	`256`	Minimum token count before compression activates
`--compression-max-turns`	`4`	Turns to keep verbatim (sliding window only)

Gotchas

Token pruning can remove important context — pruning is heuristic. Domain-specific terms that look like filler may be removed. Test with representative prompts before deploying.
Sliding window summarization is lossy — older conversation turns are condensed, not preserved verbatim. If a user references something from 10 turns ago, the model may not recall the exact wording.
Compression ratio is approximate — --compression-ratio 0.5 targets 50% reduction but the actual result depends on prompt content. Some prompts compress more than others.
Short prompts are not compressed — prompts below --compression-threshold (default: 256 tokens) pass through unchanged. This prevents over-compression of concise queries.
Compression adds prefill latency — the compression step itself takes time. For very short prompts, enabling compression may be slower than not compressing. The break-even point is typically around 500+ tokens.
Not compatible with exact reproducibility — compressed prompts produce different model inputs than uncompressed ones. If you need bit-exact output matching, disable compression.

Local Inference — server setup
Early Exit — skip unnecessary transformer layers
Speculative Decoding — automatic inference acceleration
Observability — monitor compression metrics

Quick Start​

Token Pruning​

Sliding Window​

When to Use Which​

Telemetry​

Configuration​

Gotchas​

Related​