Skip to main content

Inference Optimization Design: KV Cache Compression, Speculative Decoding, and MoE On-Device

· 11 min read

Design Principle: Every optimization works by default, with zero configuration, and produces correct results. Advanced users can tune. Nobody has to.

# This is what the developer writes. Nothing else.
model = client.load_model("my-llm")
stream = client.generate(model, prompt="Summarize this document:")

# All three optimizations activate automatically based on:
# - device hardware (RAM, NPU, core topology)
# - model architecture (dense vs MoE, attention type)
# - available memory at inference time

If the developer wants control, every optimization exposes a config object — but that second form is never required. The first form picks sane defaults for the device it's running on.


1. KV Cache Compression

Problem

KV cache grows linearly with sequence length. A 7B model at 2048 tokens with FP16 KV needs ~1GB of cache alone. On a phone with 6-8GB RAM (shared with the OS, apps, and the model weights), this is the binding constraint for context length.

How It Works

Token arrives


┌──────────────────────────────────────────────────┐
│ KV Cache Manager │
│ │
│ 1. Write new K,V to cache │
│ 2. If cache < budget: done │
│ 3. If cache >= budget: │
│ a. Score every cache entry (attention weight) │
│ b. Evict lowest-scoring entries │
│ c. Compress retained entries to INT4/INT8 │
│ 4. If eviction insufficient: │
│ a. Spill cold entries to flash (UFS/NVMe) │
│ b. Prefetch on next attention if accessed │
└──────────────────────────────────────────────────┘

Default Behavior (Zero Config)

The runtime sizes the KV cache budget automatically:

available_memory = device_ram - os_reserve - model_weights - safety_margin
kv_budget = available_memory * 0.6 # 60% of remaining memory for KV
DeviceRAMModel (7B Q4)OS ReserveKV BudgetMax Context (approx)
iPhone 15 Pro8 GB4 GB2 GB~1.2 GB~2,500 tokens
iPhone 146 GB4 GB1.5 GB~300 MB~600 tokens
Pixel 88 GB4 GB2 GB~1.2 GB~2,500 tokens
Galaxy S248 GB4 GB2 GB~1.2 GB~2,500 tokens
Mid-range Android4 GBUse 1-3B model

When the budget fills, the default eviction policy is MorphKV1 — a fixed-size cache that iteratively refines which past tokens to keep based on observed attention patterns. No tuning required. It runs continuously.

Compression Tiers

The cache manager applies compression progressively, not all-or-nothing:

TierConditionActionQuality Impact
HotAccessed in last 128 tokensFP16, no compressionNone
WarmAccessed in last 512 tokensINT8 quantizedNegligible (<0.1% perplexity)
ColdOlder than 512 tokens, low attention scoreINT4 quantizedMinor (<0.5% perplexity)
SpilledEvicted but potentially neededWritten to flash, loaded on demandLatency spike on cache miss (~2-5ms)

Flash Spill (Disk Offloading)

For devices with fast storage (UFS 4.0 on flagship Android, NVMe on iPhone 15 Pro), cold KV entries spill to flash instead of being permanently evicted. This follows the KVSwap2 approach:

  • Entries are grouped into 64-token blocks to match flash I/O granularity
  • A reuse buffer retains recently accessed blocks to avoid repeated I/O
  • Sequential read pattern optimized for mobile storage controllers
  • Disabled on devices with eMMC storage (too slow) — falls back to hard eviction

2. Speculative Decoding

Problem

Autoregressive generation is memory-bandwidth bound, not compute bound. The GPU/NPU sits mostly idle waiting for memory reads. A small "draft" model can speculatively generate N tokens cheaply, and the main model can verify all N in a single forward pass (same cost as generating 1 token).

How It Works

┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│ Draft Model │ │ Verify (batch) │ │ Output │
│ (0.5B Q4) │────▶│ Main Model │────▶│ Accepted │
│ ~200MB │ │ (7B Q4) │ │ Tokens │
└─────────────┘ └──────────────────┘ └─────────────┘
│ │
│ Generate 5-7 │ Verify all at once
│ draft tokens │ Accept prefix that matches
│ sequentially │ Reject + resample from main
│ (fast, ~5ms each) │ model at divergence point

Key insight: Verification of N tokens costs the same as generating 1 token (single forward pass with the N draft tokens as input). If the draft model's acceptance rate is >50%, you get a net speedup.

Important: Speculative decoding with standard rejection sampling produces identical output distributions to normal decoding. It's not an approximation — it's a lossless speedup.

Draft Model Strategy

Octomil ships two universal draft models, fine-tuned for high acceptance rate across common architectures:

Draft ModelSize (Q4)TargetExpected Acceptance RateSpeedup
octomil/draft-0.5b-q4~200 MBPhones with 6+ GB RAM60-70%1.8-2.2x
octomil/draft-1b-q4~450 MBPhones with 8+ GB RAM70-80%2.2-2.8x

Adaptive Lookahead

Fixed lookahead is suboptimal — easy tokens (common phrases, punctuation) have higher acceptance rates than hard tokens (technical terms, reasoning steps). The runtime adjusts:

Running acceptance rate (last 32 tokens):
> 80% → increase lookahead by 1 (max 10)
< 50% → decrease lookahead by 1 (min 2)
else → hold steady

This is tracked per-generation, not persisted. Zero configuration.

Lookahead Decoding (Draft-Free Alternative)

For devices too constrained for a draft model, lookahead decoding uses the main model's own n-gram patterns to speculate without any additional model:

Jacobi iteration: generate multiple future token positions in parallel
└─ n-gram cache: if the model has produced "the United" before,
speculate that "States" follows without running a forward pass

This gives a smaller speedup (1.3-1.5x) but requires zero additional memory. It's the fallback when the draft model doesn't fit.


3. Mixture-of-Experts (MoE) On-Device

Problem

Dense models activate every parameter for every token. A 7B dense model does 7B multiplications per token. MoE architectures (DeepSeek-V3, Mixtral, DBRX) have many more total parameters but only activate a subset ("experts") per token — e.g., 47B total params but only 7B active per token.

The challenge on-device: all experts must be in memory OR loaded from flash on demand. With 64+ experts, they can't all fit in RAM simultaneously.

How It Works

Token arrives


┌───────────────┐
│ Router/Gate │ Scores all experts, selects top-K (typically K=2)
└───────┬───────┘


┌────────────────────────────────────────────────────┐
│ Expert Cache Manager │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Expert 3 │ │Expert 7 │ │Expert 12│ │Expert 21│ │ ← Resident in RAM
│ │ (hot) │ │ (hot) │ │ (warm) │ │ (warm) │ │ (max_resident = 4-8)
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Remaining experts on flash │
│ │
│ On cache miss: │
│ 1. Predict next-needed expert from gating input │
│ 2. Begin async prefetch from flash │
│ 3. If not prefetched in time: load synchronously │
│ 4. Evict least-recently-used resident expert │
└────────────────────────────────────────────────────┘

Expert Prefetching

This is the key to making MoE feel fast on-device. Naive expert loading on a cache miss adds 5-20ms latency (UFS 4.0 sequential read). Prefetching hides this behind computation:

Gating-based prediction: The gating network's input for layer N is available before layer N's experts are needed. We run the gate for layer N+1 during layer N's forward pass, then start prefetching N+1's experts asynchronously.

Layer N forward pass (compute-bound, ~10ms)

├── Meanwhile: run layer N+1 gate on current hidden state
│ └── Predict top-K experts for layer N+1
│ └── Begin async flash read for any not resident


Layer N+1 forward pass
└── Experts already in memory (90%+ hit rate with prediction)

Research shows ~90% prediction accuracy for next-layer expert selection. On a 10% miss rate with UFS 4.0, the average latency impact is <1ms per layer.

Mixed Precision Expert Loading

Following the HOBBIT3 approach, experts don't all need the same precision:

Expert StatePrecisionWhen
Resident (hot)INT4Frequently activated, in RAM
Loading (miss)INT2Cache miss — load fast at lower precision
PromotedINT4If INT2 expert is reused, upgrade async

When a cache miss occurs, the expert loads at INT2 (half the I/O time), serves the current token, and is asynchronously upgraded to INT4 for subsequent uses. Quality impact of INT2 on a single expert for a single token is negligible.

What This Unlocks

MoE is what makes "bigger-than-device" models feasible:

ModelTotal ParamsActive Params/TokenSize on Flash (Q4)RAM Needed (4 resident experts)
Mixtral 8x7B47B13B~24 GB~5 GB
DeepSeek-V3 (mobile variant)16B (distilled)2B~8 GB~2 GB
Octomil MoE-3B (custom)12B3B~6 GB~2 GB

MoE Model Support Strategy

PhaseMoE SourceClient EffortGPU Cost
Phase 1Native MoE from HuggingFace (Mixtral, DBRX, etc.)Zero — auto-detectedNone
Phase 2Octomil catalog (upcycled versions of popular dense models)Pick from catalogOurs (one-time per model)
Phase 3Client uploads dense model, we upcycleUpload model, wait ~24hOurs (billed to client)

MoE + Federated Learning

When FL trains an MoE model across devices, FedAvg aggregates all parameter groups — experts, gate, attention — the same way: weighted averaging. Devices only upload deltas for experts they activated during local training (not all experts), reducing bandwidth proportionally.


4. How the Three Interact

These optimizations are not independent — they share memory and interact at runtime. The memory manager arbitrates:

Device RAM Budget

├── OS + App Reserve (fixed, ~2 GB)

├── Model Weights
│ ├── Non-expert layers (always resident)
│ ├── Expert cache (MoE only — dynamic, shared pool)
│ └── Draft model (speculative decoding — fixed)

├── KV Cache (dynamic, grows with context)
│ ├── Hot tier (FP16)
│ ├── Warm tier (INT8)
│ └── Cold tier (INT4, may spill to flash)

└── Safety margin (~256 MB)

Memory Arbitration

The runtime uses a unified memory pool for KV cache and expert cache. When KV cache is nearly full and experts need loading, the arbitrator can:

  1. Compress KV entries from warm to cold tier (free ~30% space)
  2. Spill cold KV to flash (free the cold tier entirely)
  3. Evict least-used experts to make room for KV

This means longer conversations gracefully trade expert cache space for KV space — the model gets slightly slower (more expert cache misses) but doesn't OOM or truncate context.

Speculative Decoding + KV Cache

Draft model tokens write to a tentative KV cache region. On rejection, the tentative entries are discarded (no wasted cache space). On acceptance, they're promoted to the main cache. Only accepted tokens consume cache budget.

Speculative Decoding + MoE

Speculative decoding amortizes expert loading cost. Instead of loading experts for 1 token, the draft generates N tokens, and the main model's verification pass processes all N at once. If multiple draft tokens activate the same expert, that expert is loaded once and used N times.


5. Device Decision Matrix

DeviceKV CompressionFlash SpillSpeculative (Draft)Speculative (Lookahead)MoE Expert Offload
iPhone 15 Pro (8 GB)YesYes (NVMe)Yes (0.5B or 1B draft)FallbackYes
iPhone 14 (6 GB)YesYes (NVMe)Yes (0.5B draft only)FallbackMarginal
Pixel 8 (8 GB)YesYes (UFS 4.0)Yes (0.5B or 1B draft)FallbackYes
Galaxy S24 (8 GB)YesYes (UFS 4.0)Yes (0.5B or 1B draft)FallbackYes
Mid-range Android (4 GB)YesNo (eMMC)No (insufficient RAM)YesNo

6. SDK Surface

Python (Server-Side Optimization)

from octomil import OctomilClient

client = OctomilClient(api_key="...")

# Deploy a model — optimization happens server-side, config bundled with model
client.deploy_model(
model_id="my-llm",
model_path="./qwen-7b",
target_devices=["iphone_15_pro", "pixel_8"],
# Optimization is automatic. Override only if needed:
# inference_config=InferenceConfig(...)
)

Swift (iOS Runtime)

// Load model — all optimizations activate automatically
let model = try await client.loadModel(modelId: "my-llm")

// Generate — speculative decoding, KV compression, MoE all transparent
let stream = client.generateStream(model: model, input: prompt)
for try await chunk in stream {
print(chunk.text, terminator: "")
}

// Inspect what the runtime chose (observability, not required)
let stats = model.inferenceStats
print(stats.kvCacheUtilization) // 0.73
print(stats.speculativeAcceptanceRate) // 0.68
print(stats.expertCacheHitRate) // 0.91

Kotlin (Android Runtime)

val model = client.loadModel("my-llm")
val stream = client.generateStream(model, prompt)

stream.collect { chunk ->
print(chunk.text)
}

val stats = model.inferenceStats
Log.d("Octomil", "KV: ${stats.kvCacheUtilization}, Spec: ${stats.speculativeAcceptanceRate}")

7. Implementation Phases

Phase 1: KV Cache Compression (Highest Impact, Lowest Risk)

MorphKV-style fixed-budget cache with tiered quantization. Auto-sizes from device profile. Validates on iPhone 15 Pro: max context extends from ~1K tokens to ~2.5K tokens at <0.5% perplexity increase.

Phase 2: Speculative Decoding (Highest Speedup, Medium Complexity)

Draft-verify with octomil/draft-0.5b-q4, adaptive lookahead, lookahead fallback. Target: 2x speedup over baseline with identical output distribution.

Phase 3a: MoE Expert Offloading — Native MoE Models

Expert cache with LHU eviction, gating-based prefetch, mixed-precision loading. Supports any HuggingFace MoE model. Target: Mixtral 8x7B fits in 5GB RAM + 24GB flash.

Phase 3b: MoE Catalog — Sparse Upcycled Dense Models

Pre-built MoE variants of popular dense models via sparse upcycling. Client picks from catalog, zero training required.


Research References


Octomil is building the developer platform for federated learning and on-device AI. Visit octomil.com to learn more.

Footnotes

  1. MorphKV — Fixed-size KV cache with iterative token selection. Used as our default KV eviction strategy.

  2. KVSwap — Flash-aware KV offloading for mobile. Basis for our flash spill implementation.

  3. HOBBIT — Mixed-precision MoE expert offloading (INT4/INT2). Basis for our expert cache with mixed precision.