Inference Optimization Design: KV Cache Compression, Speculative Decoding, and MoE On-Device
Design Principle: Every optimization works by default, with zero configuration, and produces correct results. Advanced users can tune. Nobody has to.
# This is what the developer writes. Nothing else.
model = client.load_model("my-llm")
stream = client.generate(model, prompt="Summarize this document:")
# All three optimizations activate automatically based on:
# - device hardware (RAM, NPU, core topology)
# - model architecture (dense vs MoE, attention type)
# - available memory at inference time
If the developer wants control, every optimization exposes a config object — but that second form is never required. The first form picks sane defaults for the device it's running on.
1. KV Cache Compression
Problem
KV cache grows linearly with sequence length. A 7B model at 2048 tokens with FP16 KV needs ~1GB of cache alone. On a phone with 6-8GB RAM (shared with the OS, apps, and the model weights), this is the binding constraint for context length.
How It Works
Token arrives
│
▼
┌──────────────────────────────────────────────────┐
│ KV Cache Manager │
│ │
│ 1. Write new K,V to cache │
│ 2. If cache < budget: done │
│ 3. If cache >= budget: │
│ a. Score every cache entry (attention weight) │
│ b. Evict lowest-scoring entries │
│ c. Compress retained entries to INT4/INT8 │
│ 4. If eviction insufficient: │
│ a. Spill cold entries to flash (UFS/NVMe) │
│ b. Prefetch on next attention if accessed │
└──────────────────────────────────────────────────┘
Default Behavior (Zero Config)
The runtime sizes the KV cache budget automatically:
available_memory = device_ram - os_reserve - model_weights - safety_margin
kv_budget = available_memory * 0.6 # 60% of remaining memory for KV
| Device | RAM | Model (7B Q4) | OS Reserve | KV Budget | Max Context (approx) |
|---|---|---|---|---|---|
| iPhone 15 Pro | 8 GB | 4 GB | 2 GB | ~1.2 GB | ~2,500 tokens |
| iPhone 14 | 6 GB | 4 GB | 1.5 GB | ~300 MB | ~600 tokens |
| Pixel 8 | 8 GB | 4 GB | 2 GB | ~1.2 GB | ~2,500 tokens |
| Galaxy S24 | 8 GB | 4 GB | 2 GB | ~1.2 GB | ~2,500 tokens |
| Mid-range Android | 4 GB | — | — | — | Use 1-3B model |
When the budget fills, the default eviction policy is MorphKV1 — a fixed-size cache that iteratively refines which past tokens to keep based on observed attention patterns. No tuning required. It runs continuously.
Compression Tiers
The cache manager applies compression progressively, not all-or-nothing:
| Tier | Condition | Action | Quality Impact |
|---|---|---|---|
| Hot | Accessed in last 128 tokens | FP16, no compression | None |
| Warm | Accessed in last 512 tokens | INT8 quantized | Negligible (<0.1% perplexity) |
| Cold | Older than 512 tokens, low attention score | INT4 quantized | Minor (<0.5% perplexity) |
| Spilled | Evicted but potentially needed | Written to flash, loaded on demand | Latency spike on cache miss (~2-5ms) |
Flash Spill (Disk Offloading)
For devices with fast storage (UFS 4.0 on flagship Android, NVMe on iPhone 15 Pro), cold KV entries spill to flash instead of being permanently evicted. This follows the KVSwap2 approach:
- Entries are grouped into 64-token blocks to match flash I/O granularity
- A reuse buffer retains recently accessed blocks to avoid repeated I/O
- Sequential read pattern optimized for mobile storage controllers
- Disabled on devices with eMMC storage (too slow) — falls back to hard eviction
2. Speculative Decoding
Problem
Autoregressive generation is memory-bandwidth bound, not compute bound. The GPU/NPU sits mostly idle waiting for memory reads. A small "draft" model can speculatively generate N tokens cheaply, and the main model can verify all N in a single forward pass (same cost as generating 1 token).
How It Works
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Draft Model │ │ Verify (batch) │ │ Output │
│ (0.5B Q4) │────▶│ Main Model │────▶│ Accepted │
│ ~200MB │ │ (7B Q4) │ │ Tokens │
└─────────────┘ └──────────────────┘ └─────────────┘
│ │
│ Generate 5-7 │ Verify all at once
│ draft tokens │ Accept prefix that matches
│ sequentially │ Reject + resample from main
│ (fast, ~5ms each) │ model at divergence point
Key insight: Verification of N tokens costs the same as generating 1 token (single forward pass with the N draft tokens as input). If the draft model's acceptance rate is >50%, you get a net speedup.
Important: Speculative decoding with standard rejection sampling produces identical output distributions to normal decoding. It's not an approximation — it's a lossless speedup.
Draft Model Strategy
Octomil ships two universal draft models, fine-tuned for high acceptance rate across common architectures:
| Draft Model | Size (Q4) | Target | Expected Acceptance Rate | Speedup |
|---|---|---|---|---|
octomil/draft-0.5b-q4 | ~200 MB | Phones with 6+ GB RAM | 60-70% | 1.8-2.2x |
octomil/draft-1b-q4 | ~450 MB | Phones with 8+ GB RAM | 70-80% | 2.2-2.8x |
Adaptive Lookahead
Fixed lookahead is suboptimal — easy tokens (common phrases, punctuation) have higher acceptance rates than hard tokens (technical terms, reasoning steps). The runtime adjusts:
Running acceptance rate (last 32 tokens):
> 80% → increase lookahead by 1 (max 10)
< 50% → decrease lookahead by 1 (min 2)
else → hold steady
This is tracked per-generation, not persisted. Zero configuration.
Lookahead Decoding (Draft-Free Alternative)
For devices too constrained for a draft model, lookahead decoding uses the main model's own n-gram patterns to speculate without any additional model:
Jacobi iteration: generate multiple future token positions in parallel
└─ n-gram cache: if the model has produced "the United" before,
speculate that "States" follows without running a forward pass
This gives a smaller speedup (1.3-1.5x) but requires zero additional memory. It's the fallback when the draft model doesn't fit.
3. Mixture-of-Experts (MoE) On-Device
Problem
Dense models activate every parameter for every token. A 7B dense model does 7B multiplications per token. MoE architectures (DeepSeek-V3, Mixtral, DBRX) have many more total parameters but only activate a subset ("experts") per token — e.g., 47B total params but only 7B active per token.
The challenge on-device: all experts must be in memory OR loaded from flash on demand. With 64+ experts, they can't all fit in RAM simultaneously.
How It Works
Token arrives
│
▼
┌───────────────┐
│ Router/Gate │ Scores all experts, selects top-K (typically K=2)
└───────┬───────┘
│
▼
┌────────────────────────────────────────────────────┐
│ Expert Cache Manager │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Expert 3 │ │Expert 7 │ │Expert 12│ │Expert 21│ │ ← Resident in RAM
│ │ (hot) │ │ (hot) │ │ (warm) │ │ (warm) │ │ (max_resident = 4-8)
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Remaining experts on flash │
│ │
│ On cache miss: │
│ 1. Predict next-needed expert from gating input │
│ 2. Begin async prefetch from flash │
│ 3. If not prefetched in time: load synchronously │
│ 4. Evict least-recently-used resident expert │
└────────────────────────────────────────────────────┘
Expert Prefetching
This is the key to making MoE feel fast on-device. Naive expert loading on a cache miss adds 5-20ms latency (UFS 4.0 sequential read). Prefetching hides this behind computation:
Gating-based prediction: The gating network's input for layer N is available before layer N's experts are needed. We run the gate for layer N+1 during layer N's forward pass, then start prefetching N+1's experts asynchronously.
Layer N forward pass (compute-bound, ~10ms)
│
├── Meanwhile: run layer N+1 gate on current hidden state
│ └── Predict top-K experts for layer N+1
│ └── Begin async flash read for any not resident
│
▼
Layer N+1 forward pass
└── Experts already in memory (90%+ hit rate with prediction)
Research shows ~90% prediction accuracy for next-layer expert selection. On a 10% miss rate with UFS 4.0, the average latency impact is <1ms per layer.
Mixed Precision Expert Loading
Following the HOBBIT3 approach, experts don't all need the same precision:
| Expert State | Precision | When |
|---|---|---|
| Resident (hot) | INT4 | Frequently activated, in RAM |
| Loading (miss) | INT2 | Cache miss — load fast at lower precision |
| Promoted | INT4 | If INT2 expert is reused, upgrade async |
When a cache miss occurs, the expert loads at INT2 (half the I/O time), serves the current token, and is asynchronously upgraded to INT4 for subsequent uses. Quality impact of INT2 on a single expert for a single token is negligible.
What This Unlocks
MoE is what makes "bigger-than-device" models feasible:
| Model | Total Params | Active Params/Token | Size on Flash (Q4) | RAM Needed (4 resident experts) |
|---|---|---|---|---|
| Mixtral 8x7B | 47B | 13B | ~24 GB | ~5 GB |
| DeepSeek-V3 (mobile variant) | 16B (distilled) | 2B | ~8 GB | ~2 GB |
| Octomil MoE-3B (custom) | 12B | 3B | ~6 GB | ~2 GB |
MoE Model Support Strategy
| Phase | MoE Source | Client Effort | GPU Cost |
|---|---|---|---|
| Phase 1 | Native MoE from HuggingFace (Mixtral, DBRX, etc.) | Zero — auto-detected | None |
| Phase 2 | Octomil catalog (upcycled versions of popular dense models) | Pick from catalog | Ours (one-time per model) |
| Phase 3 | Client uploads dense model, we upcycle | Upload model, wait ~24h | Ours (billed to client) |
MoE + Federated Learning
When FL trains an MoE model across devices, FedAvg aggregates all parameter groups — experts, gate, attention — the same way: weighted averaging. Devices only upload deltas for experts they activated during local training (not all experts), reducing bandwidth proportionally.
4. How the Three Interact
These optimizations are not independent — they share memory and interact at runtime. The memory manager arbitrates:
Device RAM Budget
│
├── OS + App Reserve (fixed, ~2 GB)
│
├── Model Weights
│ ├── Non-expert layers (always resident)
│ ├── Expert cache (MoE only — dynamic, shared pool)
│ └── Draft model (speculative decoding — fixed)
│
├── KV Cache (dynamic, grows with context)
│ ├── Hot tier (FP16)
│ ├── Warm tier (INT8)
│ └── Cold tier (INT4, may spill to flash)
│
└── Safety margin (~256 MB)
Memory Arbitration
The runtime uses a unified memory pool for KV cache and expert cache. When KV cache is nearly full and experts need loading, the arbitrator can:
- Compress KV entries from warm to cold tier (free ~30% space)
- Spill cold KV to flash (free the cold tier entirely)
- Evict least-used experts to make room for KV
This means longer conversations gracefully trade expert cache space for KV space — the model gets slightly slower (more expert cache misses) but doesn't OOM or truncate context.
Speculative Decoding + KV Cache
Draft model tokens write to a tentative KV cache region. On rejection, the tentative entries are discarded (no wasted cache space). On acceptance, they're promoted to the main cache. Only accepted tokens consume cache budget.
Speculative Decoding + MoE
Speculative decoding amortizes expert loading cost. Instead of loading experts for 1 token, the draft generates N tokens, and the main model's verification pass processes all N at once. If multiple draft tokens activate the same expert, that expert is loaded once and used N times.
5. Device Decision Matrix
| Device | KV Compression | Flash Spill | Speculative (Draft) | Speculative (Lookahead) | MoE Expert Offload |
|---|---|---|---|---|---|
| iPhone 15 Pro (8 GB) | Yes | Yes (NVMe) | Yes (0.5B or 1B draft) | Fallback | Yes |
| iPhone 14 (6 GB) | Yes | Yes (NVMe) | Yes (0.5B draft only) | Fallback | Marginal |
| Pixel 8 (8 GB) | Yes | Yes (UFS 4.0) | Yes (0.5B or 1B draft) | Fallback | Yes |
| Galaxy S24 (8 GB) | Yes | Yes (UFS 4.0) | Yes (0.5B or 1B draft) | Fallback | Yes |
| Mid-range Android (4 GB) | Yes | No (eMMC) | No (insufficient RAM) | Yes | No |
6. SDK Surface
Python (Server-Side Optimization)
from octomil import OctomilClient
client = OctomilClient(api_key="...")
# Deploy a model — optimization happens server-side, config bundled with model
client.deploy_model(
model_id="my-llm",
model_path="./qwen-7b",
target_devices=["iphone_15_pro", "pixel_8"],
# Optimization is automatic. Override only if needed:
# inference_config=InferenceConfig(...)
)
Swift (iOS Runtime)
// Load model — all optimizations activate automatically
let model = try await client.loadModel(modelId: "my-llm")
// Generate — speculative decoding, KV compression, MoE all transparent
let stream = client.generateStream(model: model, input: prompt)
for try await chunk in stream {
print(chunk.text, terminator: "")
}
// Inspect what the runtime chose (observability, not required)
let stats = model.inferenceStats
print(stats.kvCacheUtilization) // 0.73
print(stats.speculativeAcceptanceRate) // 0.68
print(stats.expertCacheHitRate) // 0.91
Kotlin (Android Runtime)
val model = client.loadModel("my-llm")
val stream = client.generateStream(model, prompt)
stream.collect { chunk ->
print(chunk.text)
}
val stats = model.inferenceStats
Log.d("Octomil", "KV: ${stats.kvCacheUtilization}, Spec: ${stats.speculativeAcceptanceRate}")
7. Implementation Phases
Phase 1: KV Cache Compression (Highest Impact, Lowest Risk)
MorphKV-style fixed-budget cache with tiered quantization. Auto-sizes from device profile. Validates on iPhone 15 Pro: max context extends from ~1K tokens to ~2.5K tokens at <0.5% perplexity increase.
Phase 2: Speculative Decoding (Highest Speedup, Medium Complexity)
Draft-verify with octomil/draft-0.5b-q4, adaptive lookahead, lookahead fallback. Target: 2x speedup over baseline with identical output distribution.
Phase 3a: MoE Expert Offloading — Native MoE Models
Expert cache with LHU eviction, gating-based prefetch, mixed-precision loading. Supports any HuggingFace MoE model. Target: Mixtral 8x7B fits in 5GB RAM + 24GB flash.
Phase 3b: MoE Catalog — Sparse Upcycled Dense Models
Pre-built MoE variants of popular dense models via sparse upcycling. Client picks from catalog, zero training required.
Research References
Octomil is building the developer platform for federated learning and on-device AI. Visit octomil.com to learn more.