Inference Optimization Design: KV Cache Compression, Speculative Decoding, and MoE On-Device

February 14, 2026 · 11 min read

Design Principle: Every optimization works by default, with zero configuration, and produces correct results. Advanced users can tune. Nobody has to.

# This is what the developer writes. Nothing else.
model = client.load_model("my-llm")
stream = client.generate(model, prompt="Summarize this document:")

# All three optimizations activate automatically based on:
#   - device hardware (RAM, NPU, core topology)
#   - model architecture (dense vs MoE, attention type)
#   - available memory at inference time

If the developer wants control, every optimization exposes a config object — but that second form is never required. The first form picks sane defaults for the device it's running on.

1. KV Cache Compression

Problem

KV cache grows linearly with sequence length. A 7B model at 2048 tokens with FP16 KV needs ~1GB of cache alone. On a phone with 6-8GB RAM (shared with the OS, apps, and the model weights), this is the binding constraint for context length.

How It Works

Token arrives
    │
    ▼
┌──────────────────────────────────────────────────┐
│  KV Cache Manager                                │
│                                                  │
│  1. Write new K,V to cache                       │
│  2. If cache < budget: done                      │
│  3. If cache >= budget:                          │
│     a. Score every cache entry (attention weight) │
│     b. Evict lowest-scoring entries               │
│     c. Compress retained entries to INT4/INT8     │
│  4. If eviction insufficient:                    │
│     a. Spill cold entries to flash (UFS/NVMe)    │
│     b. Prefetch on next attention if accessed     │
└──────────────────────────────────────────────────┘

Default Behavior (Zero Config)

The runtime sizes the KV cache budget automatically:

available_memory = device_ram - os_reserve - model_weights - safety_margin
kv_budget = available_memory * 0.6   # 60% of remaining memory for KV

Device	RAM	Model (7B Q4)	OS Reserve	KV Budget	Max Context (approx)
iPhone 15 Pro	8 GB	4 GB	2 GB	~1.2 GB	~2,500 tokens
iPhone 14	6 GB	4 GB	1.5 GB	~300 MB	~600 tokens
Pixel 8	8 GB	4 GB	2 GB	~1.2 GB	~2,500 tokens
Galaxy S24	8 GB	4 GB	2 GB	~1.2 GB	~2,500 tokens
Mid-range Android	4 GB	—	—	—	Use 1-3B model

When the budget fills, the default eviction policy is MorphKV¹ — a fixed-size cache that iteratively refines which past tokens to keep based on observed attention patterns. No tuning required. It runs continuously.

Compression Tiers

The cache manager applies compression progressively, not all-or-nothing:

Tier	Condition	Action	Quality Impact
Hot	Accessed in last 128 tokens	FP16, no compression	None
Warm	Accessed in last 512 tokens	INT8 quantized	Negligible (<0.1% perplexity)
Cold	Older than 512 tokens, low attention score	INT4 quantized	Minor (<0.5% perplexity)
Spilled	Evicted but potentially needed	Written to flash, loaded on demand	Latency spike on cache miss (~2-5ms)

Flash Spill (Disk Offloading)

For devices with fast storage (UFS 4.0 on flagship Android, NVMe on iPhone 15 Pro), cold KV entries spill to flash instead of being permanently evicted. This follows the KVSwap² approach:

Entries are grouped into 64-token blocks to match flash I/O granularity
A reuse buffer retains recently accessed blocks to avoid repeated I/O
Sequential read pattern optimized for mobile storage controllers
Disabled on devices with eMMC storage (too slow) — falls back to hard eviction

2. Speculative Decoding

Problem

Autoregressive generation is memory-bandwidth bound, not compute bound. The GPU/NPU sits mostly idle waiting for memory reads. A small "draft" model can speculatively generate N tokens cheaply, and the main model can verify all N in a single forward pass (same cost as generating 1 token).

How It Works

┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│ Draft Model │     │  Verify (batch)  │     │   Output    │
│  (0.5B Q4)  │────▶│  Main Model      │────▶│  Accepted   │
│  ~200MB     │     │  (7B Q4)         │     │  Tokens     │
└─────────────┘     └──────────────────┘     └─────────────┘
      │                     │
      │  Generate 5-7       │  Verify all at once
      │  draft tokens       │  Accept prefix that matches
      │  sequentially       │  Reject + resample from main
      │  (fast, ~5ms each)  │  model at divergence point

Key insight: Verification of N tokens costs the same as generating 1 token (single forward pass with the N draft tokens as input). If the draft model's acceptance rate is >50%, you get a net speedup.

Important: Speculative decoding with standard rejection sampling produces identical output distributions to normal decoding. It's not an approximation — it's a lossless speedup.

Draft Model Strategy

Octomil ships two universal draft models, fine-tuned for high acceptance rate across common architectures:

Draft Model	Size (Q4)	Target	Expected Acceptance Rate	Speedup
`octomil/draft-0.5b-q4`	~200 MB	Phones with 6+ GB RAM	60-70%	1.8-2.2x
`octomil/draft-1b-q4`	~450 MB	Phones with 8+ GB RAM	70-80%	2.2-2.8x

Adaptive Lookahead

Fixed lookahead is suboptimal — easy tokens (common phrases, punctuation) have higher acceptance rates than hard tokens (technical terms, reasoning steps). The runtime adjusts:

Running acceptance rate (last 32 tokens):
  > 80%  → increase lookahead by 1 (max 10)
  < 50%  → decrease lookahead by 1 (min 2)
  else   → hold steady

This is tracked per-generation, not persisted. Zero configuration.

Lookahead Decoding (Draft-Free Alternative)

For devices too constrained for a draft model, lookahead decoding uses the main model's own n-gram patterns to speculate without any additional model:

Jacobi iteration: generate multiple future token positions in parallel
  └─ n-gram cache: if the model has produced "the United" before,
     speculate that "States" follows without running a forward pass

This gives a smaller speedup (1.3-1.5x) but requires zero additional memory. It's the fallback when the draft model doesn't fit.

3. Mixture-of-Experts (MoE) On-Device

Problem

Dense models activate every parameter for every token. A 7B dense model does 7B multiplications per token. MoE architectures (DeepSeek-V3, Mixtral, DBRX) have many more total parameters but only activate a subset ("experts") per token — e.g., 47B total params but only 7B active per token.

The challenge on-device: all experts must be in memory OR loaded from flash on demand. With 64+ experts, they can't all fit in RAM simultaneously.

How It Works

Token arrives
    │
    ▼
┌───────────────┐
│  Router/Gate  │  Scores all experts, selects top-K (typically K=2)
└───────┬───────┘
        │
        ▼
┌────────────────────────────────────────────────────┐
│  Expert Cache Manager                              │
│                                                    │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│  │Expert 3 │ │Expert 7 │ │Expert 12│ │Expert 21│ │  ← Resident in RAM
│  │ (hot)   │ │ (hot)   │ │ (warm)  │ │ (warm)  │ │    (max_resident = 4-8)
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│                                                    │
│  Remaining experts on flash                        │
│                                                    │
│  On cache miss:                                    │
│  1. Predict next-needed expert from gating input   │
│  2. Begin async prefetch from flash                │
│  3. If not prefetched in time: load synchronously  │
│  4. Evict least-recently-used resident expert      │
└────────────────────────────────────────────────────┘

Expert Prefetching

This is the key to making MoE feel fast on-device. Naive expert loading on a cache miss adds 5-20ms latency (UFS 4.0 sequential read). Prefetching hides this behind computation:

Gating-based prediction: The gating network's input for layer N is available before layer N's experts are needed. We run the gate for layer N+1 during layer N's forward pass, then start prefetching N+1's experts asynchronously.

Layer N forward pass (compute-bound, ~10ms)
    │
    ├── Meanwhile: run layer N+1 gate on current hidden state
    │   └── Predict top-K experts for layer N+1
    │       └── Begin async flash read for any not resident
    │
    ▼
Layer N+1 forward pass
    └── Experts already in memory (90%+ hit rate with prediction)

Research shows ~90% prediction accuracy for next-layer expert selection. On a 10% miss rate with UFS 4.0, the average latency impact is <1ms per layer.

Mixed Precision Expert Loading

Following the HOBBIT³ approach, experts don't all need the same precision:

Expert State	Precision	When
Resident (hot)	INT4	Frequently activated, in RAM
Loading (miss)	INT2	Cache miss — load fast at lower precision
Promoted	INT4	If INT2 expert is reused, upgrade async

When a cache miss occurs, the expert loads at INT2 (half the I/O time), serves the current token, and is asynchronously upgraded to INT4 for subsequent uses. Quality impact of INT2 on a single expert for a single token is negligible.

What This Unlocks

MoE is what makes "bigger-than-device" models feasible:

Model	Total Params	Active Params/Token	Size on Flash (Q4)	RAM Needed (4 resident experts)
Mixtral 8x7B	47B	13B	~24 GB	~5 GB
DeepSeek-V3 (mobile variant)	16B (distilled)	2B	~8 GB	~2 GB
Octomil MoE-3B (custom)	12B	3B	~6 GB	~2 GB

MoE Model Support Strategy

Phase	MoE Source	Client Effort	GPU Cost
Phase 1	Native MoE from HuggingFace (Mixtral, DBRX, etc.)	Zero — auto-detected	None
Phase 2	Octomil catalog (upcycled versions of popular dense models)	Pick from catalog	Ours (one-time per model)
Phase 3	Client uploads dense model, we upcycle	Upload model, wait ~24h	Ours (billed to client)

MoE + Federated Learning

When FL trains an MoE model across devices, FedAvg aggregates all parameter groups — experts, gate, attention — the same way: weighted averaging. Devices only upload deltas for experts they activated during local training (not all experts), reducing bandwidth proportionally.

4. How the Three Interact

These optimizations are not independent — they share memory and interact at runtime. The memory manager arbitrates:

Device RAM Budget
    │
    ├── OS + App Reserve (fixed, ~2 GB)
    │
    ├── Model Weights
    │   ├── Non-expert layers (always resident)
    │   ├── Expert cache (MoE only — dynamic, shared pool)
    │   └── Draft model (speculative decoding — fixed)
    │
    ├── KV Cache (dynamic, grows with context)
    │   ├── Hot tier (FP16)
    │   ├── Warm tier (INT8)
    │   └── Cold tier (INT4, may spill to flash)
    │
    └── Safety margin (~256 MB)

Memory Arbitration

The runtime uses a unified memory pool for KV cache and expert cache. When KV cache is nearly full and experts need loading, the arbitrator can:

Compress KV entries from warm to cold tier (free ~30% space)
Spill cold KV to flash (free the cold tier entirely)
Evict least-used experts to make room for KV

This means longer conversations gracefully trade expert cache space for KV space — the model gets slightly slower (more expert cache misses) but doesn't OOM or truncate context.

Speculative Decoding + KV Cache

Draft model tokens write to a tentative KV cache region. On rejection, the tentative entries are discarded (no wasted cache space). On acceptance, they're promoted to the main cache. Only accepted tokens consume cache budget.

Speculative Decoding + MoE

Speculative decoding amortizes expert loading cost. Instead of loading experts for 1 token, the draft generates N tokens, and the main model's verification pass processes all N at once. If multiple draft tokens activate the same expert, that expert is loaded once and used N times.

5. Device Decision Matrix

Device	KV Compression	Flash Spill	Speculative (Draft)	Speculative (Lookahead)	MoE Expert Offload
iPhone 15 Pro (8 GB)	Yes	Yes (NVMe)	Yes (0.5B or 1B draft)	Fallback	Yes
iPhone 14 (6 GB)	Yes	Yes (NVMe)	Yes (0.5B draft only)	Fallback	Marginal
Pixel 8 (8 GB)	Yes	Yes (UFS 4.0)	Yes (0.5B or 1B draft)	Fallback	Yes
Galaxy S24 (8 GB)	Yes	Yes (UFS 4.0)	Yes (0.5B or 1B draft)	Fallback	Yes
Mid-range Android (4 GB)	Yes	No (eMMC)	No (insufficient RAM)	Yes	No

6. SDK Surface

Python (Server-Side Optimization)

from octomil import OctomilClient

client = OctomilClient(api_key="...")

# Deploy a model — optimization happens server-side, config bundled with model
client.deploy_model(
    model_id="my-llm",
    model_path="./qwen-7b",
    target_devices=["iphone_15_pro", "pixel_8"],
    # Optimization is automatic. Override only if needed:
    # inference_config=InferenceConfig(...)
)

Swift (iOS Runtime)

// Load model — all optimizations activate automatically
let model = try await client.loadModel(modelId: "my-llm")

// Generate — speculative decoding, KV compression, MoE all transparent
let stream = client.generateStream(model: model, input: prompt)
for try await chunk in stream {
    print(chunk.text, terminator: "")
}

// Inspect what the runtime chose (observability, not required)
let stats = model.inferenceStats
print(stats.kvCacheUtilization)        // 0.73
print(stats.speculativeAcceptanceRate) // 0.68
print(stats.expertCacheHitRate)        // 0.91

Kotlin (Android Runtime)

val model = client.loadModel("my-llm")
val stream = client.generateStream(model, prompt)

stream.collect { chunk ->
    print(chunk.text)
}

val stats = model.inferenceStats
Log.d("Octomil", "KV: ${stats.kvCacheUtilization}, Spec: ${stats.speculativeAcceptanceRate}")

7. Implementation Phases

Phase 1: KV Cache Compression (Highest Impact, Lowest Risk)

MorphKV-style fixed-budget cache with tiered quantization. Auto-sizes from device profile. Validates on iPhone 15 Pro: max context extends from ~1K tokens to ~2.5K tokens at <0.5% perplexity increase.

Phase 2: Speculative Decoding (Highest Speedup, Medium Complexity)

Draft-verify with octomil/draft-0.5b-q4, adaptive lookahead, lookahead fallback. Target: 2x speedup over baseline with identical output distribution.

Phase 3a: MoE Expert Offloading — Native MoE Models

Expert cache with LHU eviction, gating-based prefetch, mixed-precision loading. Supports any HuggingFace MoE model. Target: Mixtral 8x7B fits in 5GB RAM + 24GB flash.

Phase 3b: MoE Catalog — Sparse Upcycled Dense Models

Pre-built MoE variants of popular dense models via sparse upcycling. Client picks from catalog, zero training required.

Research References

Octomil is building the developer platform for federated learning and on-device AI. Visit octomil.com to learn more.

MorphKV — Fixed-size KV cache with iterative token selection. Used as our default KV eviction strategy. ↩
KVSwap — Flash-aware KV offloading for mobile. Basis for our flash spill implementation. ↩
HOBBIT — Mixed-precision MoE expert offloading (INT4/INT2). Basis for our expert cache with mixed precision. ↩

1. KV Cache Compression​

Problem​

How It Works​

Default Behavior (Zero Config)​

Compression Tiers​

Flash Spill (Disk Offloading)​

2. Speculative Decoding​

Problem​

How It Works​

Draft Model Strategy​

Adaptive Lookahead​

Lookahead Decoding (Draft-Free Alternative)​

3. Mixture-of-Experts (MoE) On-Device​

Problem​

How It Works​

Expert Prefetching​

Mixed Precision Expert Loading​

What This Unlocks​

MoE Model Support Strategy​

MoE + Federated Learning​

4. How the Three Interact​

Memory Arbitration​

Speculative Decoding + KV Cache​

Speculative Decoding + MoE​

5. Device Decision Matrix​

6. SDK Surface​

Python (Server-Side Optimization)​

Swift (iOS Runtime)​

Kotlin (Android Runtime)​

7. Implementation Phases​

Phase 1: KV Cache Compression (Highest Impact, Lowest Risk)​

Phase 2: Speculative Decoding (Highest Speedup, Medium Complexity)​

Phase 3a: MoE Expert Offloading — Native MoE Models​

Phase 3b: MoE Catalog — Sparse Upcycled Dense Models​

Research References​

Footnotes​

1. KV Cache Compression

Problem

How It Works

Default Behavior (Zero Config)

Compression Tiers

Flash Spill (Disk Offloading)

2. Speculative Decoding

Problem

How It Works

Draft Model Strategy

Adaptive Lookahead

Lookahead Decoding (Draft-Free Alternative)

3. Mixture-of-Experts (MoE) On-Device

Problem

How It Works

Expert Prefetching

Mixed Precision Expert Loading

What This Unlocks

MoE Model Support Strategy

MoE + Federated Learning

4. How the Three Interact

Memory Arbitration

Speculative Decoding + KV Cache

Speculative Decoding + MoE

5. Device Decision Matrix

6. SDK Surface

Python (Server-Side Optimization)

Swift (iOS Runtime)

Kotlin (Android Runtime)

7. Implementation Phases

Phase 1: KV Cache Compression (Highest Impact, Lowest Risk)

Phase 2: Speculative Decoding (Highest Speedup, Medium Complexity)

Phase 3a: MoE Expert Offloading — Native MoE Models

Phase 3b: MoE Catalog — Sparse Upcycled Dense Models

Research References

Footnotes