Skip to main content

On-Device LLM Inference: The Definitive 2025-2026 Guide

· 30 min read

In under two years, on-device language models went from research curiosity to mainstream product feature. Smartphones can now run language models with up to 47 billion parameters. Flagship NPUs have crossed the 100 TOPS threshold. Multimodal models process text, images, audio, and video without cloud connectivity. And the first frameworks enabling actual on-device fine-tuning have arrived.

This guide covers the full arc from early 2025 through February 2026: optimization techniques, hardware capabilities, model releases, inference frameworks, performance benchmarks, commercial deployments, and the emerging frontier of on-device training. It is the most comprehensive single resource on mobile LLM inference available today.

Inference Optimization Design: KV Cache Compression, Speculative Decoding, and MoE On-Device

· 11 min read

Design Principle: Every optimization works by default, with zero configuration, and produces correct results. Advanced users can tune. Nobody has to.

# This is what the developer writes. Nothing else.
model = client.load_model("my-llm")
stream = client.generate(model, prompt="Summarize this document:")

# All three optimizations activate automatically based on:
# - device hardware (RAM, NPU, core topology)
# - model architecture (dense vs MoE, attention type)
# - available memory at inference time

If the developer wants control, every optimization exposes a config object — but that second form is never required. The first form picks sane defaults for the device it's running on.

The Seven Vectors of Convergence: Why On-Device AI Is Inevitable

· 26 min read

February 2026

Technology paradigm shifts do not arrive as single breakthroughs. They arrive as convergences -- multiple independent trends, each advancing on its own trajectory, reaching a critical density at the same moment in time. The PC revolution required cheap transistors, graphical interfaces, and spreadsheet software simultaneously. The mobile revolution required capacitive touchscreens, 3G networks, and app distribution simultaneously. Cloud computing required virtualization, broadband ubiquity, and pay-per-use billing simultaneously.

We are now witnessing a convergence of equal magnitude. Seven independent vectors -- in hardware, software optimization, regulation, economics, device proliferation, application architecture, and developer infrastructure -- are aligning toward a single, unavoidable conclusion: the future of AI inference is on-device, and the future of AI improvement is federated.

This paper traces each vector with specificity, projects where each leads, and demonstrates why their intersection creates one of the largest platform opportunities in the history of computing.

Beyond Cross-Entropy: Federated AUC Maximization and X-Risk Optimization

· 9 min read

The default objective in machine learning is minimizing cross-entropy loss. It's what PyTorch uses out-of-the-box. It's what most federated learning papers optimize. It's simple, well-understood, and works great for balanced classification problems.

But real-world applications rarely have balanced classes or standard objectives.

  • Medical diagnosis: 1% of patients have disease → 99% accuracy means predicting "no disease" for everyone
  • Fraud detection: 0.1% of transactions are fraudulent → standard loss fails
  • Financial risk: We care about tail risk (the 1% worst-case scenarios), not average loss
  • Imbalanced federated data: Some devices have only positive examples, others only negatives

This post explores specialized optimization objectives for federated learning and how Octomil supports them through recent breakthroughs from Guo, Yang, and collaborators.

Byzantine-Robust FL: Defending Against Malicious Devices

· 8 min read

Federated learning has an adversary problem.

When training across thousands or millions of devices, you can't trust everyone. Some devices may be:

  • Compromised by malware
  • Malicious (intentionally poisoning the model)
  • Faulty (hardware errors, bugs)
  • Adversarially motivated (competitors, attackers)

A single malicious device uploading carefully crafted gradients can completely destroy model accuracy. Without defenses, federated learning is vulnerable to Byzantine attacks—named after the Byzantine Generals' Problem where some participants may be traitors.

This post explores Byzantine-robust aggregation methods and how Octomil implements defenses against adversarial devices.

Federated LLMs: Prompting, Cascading, and Fine-Tuning at Scale

· 11 min read

Large Language Models have changed everything—including federated learning.

The old FL paradigm: Train a small model (~100M parameters) from scratch across devices.

The new FL paradigm: Adapt a massive pre-trained model (7B-70B parameters) using federated techniques.

But LLMs bring unique challenges to federated learning:

  • Size: 7B parameters = 28 GB (won't fit on most devices)
  • Compute: Full fine-tuning requires massive GPU memory
  • Inference cost: Running LLM inference on-device drains battery
  • Privacy: LLM memorization can leak training data

This post explores cutting-edge techniques for federated LLMs, from Virginia Smith's research group and beyond, showing how to make federated learning work in the foundation model era.

Second-Order Federated Learning: When Newton Beats SGD

· 10 min read

Federated learning loves first-order methods. FedAvg, SCAFFOLD, FedProx—they all use gradients (first derivatives). They're simple, memory-efficient, and work reasonably well.

But here's a provocative question: What if we could converge in 10 rounds instead of 100?

Second-order methods—using curvature information (Hessians, second derivatives)—can achieve dramatically faster convergence by taking smarter steps. The classic tradeoff: Fewer iterations, but more computation per iteration.

In federated learning, this tradeoff flips in our favor: Communication is expensive, computation is cheap (especially with modern accelerators). Trading computation for communication is exactly what we want.

This post explores second-order methods for FL and shows when they're worth the extra compute.

Variance Reduction: The Secret to Fast FL Convergence

· 10 min read

Why does federated learning take so many communication rounds to converge?

A typical FL training job might require:

  • Standard SGD: 1,000+ rounds to converge
  • With variance reduction: 100-200 rounds to converge
  • Result: 5-10× speedup in wall-clock time

Variance reduction is the algorithmic technique that makes this possible. It's the difference between federated learning being a research curiosity and a production-viable technology.

This post dives into variance reduction methods—MARINA, PAGE, SAGA, and their variants—and explains why they're fundamental to efficient federated learning.

Communication-Efficient Federated Learning: From Theory to Production

· 6 min read

Communication is the most expensive operation in federated learning. While devices have increasingly powerful processors, network bandwidth remains constrained—especially on mobile devices with unreliable connections. This fundamental bottleneck has driven a decade of research into communication-efficient FL techniques.

In this post, we explore state-of-the-art communication reduction methods and show how Octomil implements these techniques for production use.

Handling Device Heterogeneity: Asynchronous FL for the Real World

· 9 min read

The textbook version of federated learning assumes a perfect world:

  • All devices have similar compute power
  • Network connections are equally fast
  • Devices complete training at roughly the same time
  • No one drops out mid-round

Reality: None of these assumptions hold.

In production FL, you're coordinating across:

  • iPhone 15 Pro (6-core CPU, 16-core GPU) vs. budget Android (4-core, no GPU)
  • 5G fiber (1 Gbps) vs. rural 3G (0.5 Mbps)
  • Always-plugged smart display vs. battery-conscious smartphone
  • Reliable edge server vs. intermittent mobile device

This post explores how Octomil handles the chaos of real-world device heterogeneity through asynchronous federated learning.