Inference Optimization Design: KV Cache Compression, Speculative Decoding, and MoE On-Device
Design Principle: Every optimization works by default, with zero configuration, and produces correct results. Advanced users can tune. Nobody has to.
# This is what the developer writes. Nothing else.
model = client.load_model("my-llm")
stream = client.generate(model, prompt="Summarize this document:")
# All three optimizations activate automatically based on:
# - device hardware (RAM, NPU, core topology)
# - model architecture (dense vs MoE, attention type)
# - available memory at inference time
If the developer wants control, every optimization exposes a config object — but that second form is never required. The first form picks sane defaults for the device it's running on.