On-Device LLM Inference: The Definitive 2025-2026 Guide

February 18, 2026 · 30 min read

In under two years, on-device language models went from research curiosity to mainstream product feature. Smartphones can now run language models with up to 47 billion parameters. Flagship NPUs have crossed the 100 TOPS threshold. Multimodal models process text, images, audio, and video without cloud connectivity. And the first frameworks enabling actual on-device fine-tuning have arrived.

This guide covers the full arc from early 2025 through February 2026: optimization techniques, hardware capabilities, model releases, inference frameworks, performance benchmarks, commercial deployments, and the emerging frontier of on-device training. It is the most comprehensive single resource on mobile LLM inference available today.

The optimization toolkit

The techniques for compressing and accelerating LLMs on mobile have grown from a handful of quantization methods to a full stack of complementary approaches spanning model design, training, and inference.

Quantization: from 4-bit standard to 2-bit frontier

4-bit quantization is the industry standard for on-device deployment. AWQ (Activation-aware Weight Quantization)¹ achieves approximately 95% quality retention with perplexity of 6.84 on standard benchmarks, while GPTQ² provides roughly 90% quality at similar compression. The GGUF format -- used by llama.cpp³ -- offers flexible hybrid schemes like Q4_K_M (mixing 4-bit and 6-bit) and achieves perplexity of 6.74 with strong mobile CPU performance, making it the most widely used format for on-device inference.

Meta introduced SpinQuant⁴ in late 2024, using learned rotation matrices to eliminate weight outliers before quantization. SpinQuant achieves W4A4KV4 (4-bit weights, activations, and KV cache simultaneously) with only a 2.9-point accuracy gap on Llama-2 7B zero-shot reasoning. It is integrated into ExecuTorch and was used for the official quantized Llama 3.2 1B/3B releases. SpinQuant was accepted at ICLR 2025, where it demonstrated W4A8KV4 quantization with under 3% accuracy loss by reshaping activation distributions before quantization.

At the extreme frontier, AQLM (Additive Quantization)⁵ pushes viable quantization to just 2 bits per parameter -- compressing a 7B model to approximately 1.75 GB, small enough for any modern smartphone. QuIP#⁶ achieves viable 2-bit quantization on select models using Hadamard rotations and lattice-based weight mapping (perplexity of 6.06 on Llama-2 70B at 2-bit).

ParetoQ⁷ (Meta, NeurIPS 2025) represents the most significant quantization advance of the period, establishing that for a fixed model-size budget, a larger model at 2 bits outperforms a half-size model at 4 bits -- a finding with profound implications for mobile deployment. ParetoQ implemented 2-bit GPU kernels achieving 4.14x speedup over FP16 with integration into torchao and ExecuTorch. ParetoQ also discovered a fundamental learning transition between 2 and 3 bits: below 3 bits, models learn fundamentally different representations rather than compressed versions. Its ternary 600M model outperforms previous state-of-the-art ternary models at 3B, suggesting that smaller, natively-trained low-bit models may outperform larger post-training-quantized ones.

KV-cache optimization and attention innovations

Grouped-Query Attention (GQA)⁸ is now the default for mobile-targeted models, reducing KV cache by a factor proportional to the group size. Llama 3.2 3B uses 8 KV heads versus 32 query heads, yielding a 4x KV cache reduction. Sliding window attention (used by Mistral 7B with a 4096-token window) caps memory usage regardless of context length. MLC LLM⁹ implements paged KV cache on mobile, borrowed from server-side vLLM, enabling robust long-context inference at 64K-128K tokens without quadratic memory scaling.

KV-cache quantization progressed substantially through 2025. KVQuant¹⁰ (NeurIPS 2024) enables Llama-7B with 1M context on a single A100 through per-channel key quantization. KVTuner automatically searches for optimal per-layer KV precision (K8V4, K4V2), achieving near-lossless quality at an average 3.25-bit KV cache. EvolKV uses evolutionary search for per-layer cache budgets, matching full KV cache performance with just 1.5% of the original budget on some tasks. ChunkKV treats semantic chunks rather than individual tokens as compression units, yielding 26% throughput improvement. Apple's A19 Pro added hardware support for the mxfp4 (microscaling FP4) format, signaling hardware catching up to software quantization advances.

Pruning, distillation, and architectural innovations

Pruning matured into a practical complement to quantization. Wanda¹¹ (CMU/Meta, ICLR 2024) prunes by the product of weight magnitude and activation norm, requiring only a single forward pass -- 300x faster than SparseGPT -- while matching its accuracy at 50% unstructured sparsity on Llama-7B. Combining INT4 quantization with 75% pruning yields significantly better quality than INT2 quantization alone at equivalent model size.

Knowledge distillation drove the creation of the most capable small models. Meta's Llama 3.2 1B/3B used structured pruning from Llama 3.1 8B followed by distillation incorporating logits from both the 8B and 70B models. Google's Gemma 2 2B replaced standard next-token prediction entirely with distillation from a larger teacher. Microsoft's Phi series trained primarily on synthetic data generated by GPT-4o -- and Phi-4 (14B) actually surpassed its teacher on STEM benchmarks.

MobileLLM¹² (Meta, ICML 2024) demonstrated that for sub-billion models, depth matters more than width -- a finding that challenges standard scaling laws. Its deep-and-thin architecture with embedding sharing, GQA, and block-wise weight sharing achieves 2.7-4.3% accuracy gains at 125M-350M scale. The block-wise weight sharing technique, where adjacent transformer blocks share identical weights, adds 0.7-0.8% accuracy at zero additional model size.

Speculative decoding and operator fusion

Speculative decoding is being adapted for mobile through systems like EdgeLLM¹³ (IEEE TMC 2024), achieving up to 9.3x speedup by running a small draft model on-device while a larger target verifies. Universal Speculative Decoding¹⁴ (ICML 2025) enables any small draft model to accelerate any target LLM regardless of vocabulary differences, achieving up to 2.8x faster inference. MediaTek's Dimensity 9400+ introduced Speculative Decoding+ for 20% faster agentic AI. Experimental diffusion-based LLMs (LLaDA, TiDAR) that predict multiple tokens per step showed 4-6x speedups in research settings.

Operator fusion is a key mobile optimization. MNN-LLM¹⁵ (Alibaba) uses automatic region fusion to achieve 8.6x prefill speedup over llama.cpp on CPU and 25.3x on GPU. ExecuTorch¹⁶ integrates KleidiAI's optimized low-bit matrix multiplication kernels for Arm CPUs. The llm.npu system¹⁷ (ASPLOS 2025) addresses running LLMs on commercial mobile NPUs by scheduling float operators (LayerNorm, Attention) to CPU/GPU while routing integer-heavy MatMul to the NPU.

Hardware: the 100 TOPS NPU era

The mobile SoC AI race through early 2025

Through early 2025, the leading mobile chips represented a significant but incremental step in AI capability:

SoC	Process	NPU TOPS	Key LLM capability
Snapdragon 8 Gen 3	TSMC 4nm	Est. 40-50	20 tok/s (Llama-2 7B on NPU), 3B-13B models
Snapdragon 8 Elite	TSMC 3nm	~45% faster than Gen 3	17.9 tok/s (Llama-2 7B), multimodal demo
Apple A18 Pro	TSMC 3nm	35 TOPS	Apple Intelligence (~3B model), 15% faster AI vs A17
Dimensity 9400	TSMC 3nm	50 TOPS	AI Benchmark leader, on-device LoRA training
Exynos 2500	Samsung 3nm GAA	59 TOPS	39% NPU improvement over Exynos 2400
Tensor G4	Samsung 4nm	Low (same as G3)	45 tok/s (Gemini Nano ~3.5B), software-optimized

The 2025-2026 leap: five new flagship chipsets

Every major mobile silicon vendor shipped new AI-focused chips between August and December 2025, converging on TSMC's 3nm N3P process (except Samsung, which broke new ground with 2nm).

Qualcomm's Snapdragon 8 Elite Gen 5 leads the pack with its Hexagon NPU delivering an estimated 100 TOPS -- more than double the original 8 Elite. Qualcomm claims 220 tokens/second decode speed and 32K context window support for on-device LLMs. The chip pairs 3rd-gen Oryon cores (peak 4.6 GHz) with 38% higher memory bandwidth. Google's LiteRT QNN accelerator on this NPU achieved 11,000+ tokens/second prefill for FastVLM, with a time-to-first-token of just 0.12 seconds on 1024x1024 images.

MediaTek's Dimensity 9500 matches Qualcomm's AI claims with its NPU 990 rated at 100 TOPS. Its standout innovation is compute-in-memory (CIM) architecture for always-on AI and native BitNet 1.58-bit model processing, claiming 33% lower power through ternary weight support. MediaTek reports 2x token generation speed for 3B-parameter LLMs versus the Dimensity 9400, plus industry-first 128K token context processing on-device.

Apple's A19 Pro embedded Neural Accelerators directly into each GPU core -- dedicated tensor units enabling approximately 4x peak GPU compute for ML workloads compared to the A18 Pro. The iPhone 17 Pro provides 12GB LPDDR5X-9600 at 76.8 GB/s bandwidth, and Apple added a vapor chamber cooling system for 40% better sustained AI performance. Third-party benchmarks show the iPhone 17 Pro achieving 136 tokens/second for a quantized sub-1B model.

Google's Tensor G5 represents a fundamental reset -- the first Tensor chip manufactured by TSMC rather than Samsung Foundry. Its 4th-generation TPU is 60% more powerful than the G4 and runs Gemini Nano 2.6x faster and 2x more efficiently. The chip supports a 32,000-token context window (up from 12K) and keeps ~3GB of RAM dedicated to AI models.

Samsung's Exynos 2600 is the world's first 2nm GAA (Gate-All-Around) smartphone chip, with 32,768 MAC units and 113% improvement in generative AI performance over the Exynos 2500.

Chipset	Est. TOPS	Process	Key AI innovation	Memory
Snapdragon 8 Elite Gen 5	~100	TSMC 3nm	220 tok/s decode, 32K context	LPDDR5X, 38% higher BW
Dimensity 9500	~100	TSMC 3nm	CIM architecture, BitNet support	4-ch UFS 4.1
Apple A19 Pro	~35 (NE only)	TSMC 3nm	GPU Neural Accelerators (4x ML)	12GB LPDDR5X-9600
Tensor G5	Undisclosed (60% up)	TSMC 3nm	Matryoshka transformer, 32K context	16GB LPDDR5X (Pro)
Exynos 2600	Undisclosed (113% up)	Samsung 2nm	32K MAC, first 2nm mobile	LPDDR5X

Memory bandwidth remains the wall

Memory bandwidth -- not compute -- is the fundamental bottleneck for on-device token generation. LPDDR5x delivers 50-90 GB/s across flagships versus 2-3 TB/s for datacenter GPUs -- a 30-50x gap. Since LLM decode is memory-bandwidth-bound, this limits how fast large models can generate tokens regardless of NPU TOPS. A 4-bit 3B model needs ~2 GB read per token; at 70 GB/s bandwidth, the theoretical ceiling is ~35 tokens/second.

The industry is attacking this from multiple angles: LPDDR5X-9600 (76.8 GB/s on iPhone 17 Pro), JEDEC's LPDDR6 spec (published July 2025, promising 14.4 Gbps per pin, expected in devices by mid-2026), MediaTek's compute-in-memory architecture, and Apple's on-chip High Performance Memory caches.

What phones can actually run

Model size versus device capability

For standard dense models, the practical limits are:

Model size	Quantization	Approx. memory	Viable on	Typical decode speed
0.5B-1B	Q4	0.4-0.7 GB	Any modern phone	40-87 tok/s
1.5B-3B	Q4	1-2 GB	Mid-range+ (6 GB RAM)	15-40 tok/s
7B	Q4	3.5-5 GB	Flagships (12+ GB RAM)	8-12 tok/s
13B	Q4	8-10 GB	Ultra-flagships (16+ GB)	3-6 tok/s
47B (MoE, sparse)	Custom	Streaming from flash	24 GB devices	~12 tok/s

RAM remains the binding constraint. A flagship phone with 12 GB total has roughly 8 GB available after OS overhead. iOS imposes strict per-app memory limits that further constrain model sizes. Google's Pixel 9 Pro dedicates a 3 GB partition of its 16 GB exclusively for AI workloads.

PowerInfer-2 achieved the record: a 47B-parameter MoE model (TurboSparse-Mixtral) running on a OnePlus 12 at 11.68 tokens/second.¹⁸ Published at MobiCom '24, this system exploits the fact that MoE architectures activate only a fraction of parameters per token (~3B of 47B) and uses neuron-cluster-level I/O pipelining to stream weights from flash storage -- a 22x speedup over llama.cpp on the same device.

Techniques for exceeding available RAM include PowerInfer-2's flash-to-DRAM weight streaming, Apple Research's "LLM in a Flash" approach, and MNN-LLM's DRAM-Flash hybrid storage. Activation sparsity -- using ReLU-family activations to achieve ~90% sparsity -- means most neuron weights never need loading for a given token.

Mobile-first small language models

The first wave: late 2024 through early 2025

Llama 3.2 1B/3B¹⁹ (September 2024) arrived with 128K context, tool-calling, and day-one optimization for Snapdragon, Dimensity, and Arm hardware. Quantized variants using SpinQuant and QLoRA achieved 56% size reduction and 4.2x prefill improvement. SmolLM2²⁰ (November 2024) at 135M/360M/1.7B parameters outperformed Llama 1B on science reasoning after training on 11 trillion tokens. Qwen 2.5²¹ (September 2024) provided edge models at 0.5B/1.5B/3B trained on 18 trillion tokens with 128K context. Phi-4²² (December 2024, 14B) outperformed Llama 3.3 70B on math.

The second wave: mid-2025

The most consequential release was Google's Gemma 3n²³ (preview May 2025), with genuine architectural innovation. Its MatFormer (Matryoshka Transformer) design nests an effective-2B model inside an effective-4B model, enabling elastic inference that adapts compute to the task. The E4B variant became the first sub-10B model to exceed 1300 Elo on LMArena. Its Per-Layer Embeddings (PLE) technique offloads embedding parameters to fast storage, cutting active RAM usage. Gemma 3n is natively multimodal -- processing text, images, audio, and video -- using a MobileNet-V5 vision encoder running at 60 FPS on Pixel and a USM audio encoder.

Google also released Gemma 3 270M (August 2025), consuming just 0.75% battery for 25 conversations on a Pixel 9 Pro with INT4 quantization. The full Gemma 3 1B achieves 2,585 tokens/second prefill on mobile GPU.

Alibaba's Qwen 3 family²⁴ (April 2025) delivered dense models from 0.6B to 32B plus MoE variants, with Qwen3-30B-A3B (30B total, 3B active, 128 experts) outperforming QwQ-32B with one-tenth the active parameters. In February 2026, Alibaba released Qwen 3.5 -- a 397B MoE model natively multimodal across text, images, video, and audio in 201 languages.

SmolLM3²⁵ (July 2025) brought a 3B model with dual-mode reasoning (thinking/non-thinking), trained on 11.2 trillion tokens, supporting 128K context. It scores 36.7% on AIME 2025 in thinking mode -- competitive with models 2-3x its size.

Microsoft's Phi-4 family expanded with Phi-4-mini (3.8B, 128K context, MIT license), Phi-4-multimodal (5.6B, text+image+speech, #1 on HuggingFace OpenASR leaderboard), and Phi-4-mini-reasoning (3.8B, chain-of-thought, 25-40 tok/s on Snapdragon NPU).

Mistral released Ministral 3²⁶ (December 2025) -- nine dense models across 3B, 8B, and 14B in base, instruct, and reasoning variants, all Apache 2.0, targeting phones, drones, and robots. The smallest variants run on 4GB VRAM.

Apple published its Foundation Models technical report²⁷ (July 2025), revealing a ~3B on-device model with KV-cache sharing (37.5% of layers share KV projections, reducing cache memory and TTFT by ~37.5%) and 2-bit quantization-aware training. The model uses LoRA adapters for task specialization.

BitNet and 1-bit inference

Microsoft's BitNet b1.58²⁸ proved ternary quantization (1) can match FP16 performance at 3B+ scale while being 2.71x faster with 3.55x less memory. The accompanying bitnet.cpp framework (October 2024) achieved 2.37-6.17x speedups on x86 CPUs and 55-82% energy reduction. A 100B BitNet model could theoretically run on a single CPU at 5-7 tokens/second -- human reading speed.

BitNet b1.58 2B4T²⁹ (April 2025) became the first open-source native 1-bit LLM at 2B parameters, trained from scratch on 4 trillion tokens. It occupies just 0.4 GB for non-embedding weights (versus 1.4-4.8 GB for comparable models), achieves 29ms decoding latency on CPU, and consumes 6x less energy per inference than Gemma 3 1B. It outperforms Llama 3.2 1B, Gemma 3 1B, and Qwen 2.5 1.5B on standard benchmarks.

The bitnet.cpp framework continued improving: GPU inference kernels arrived in May 2025, and a CPU optimization update in January 2026 delivered 1.15-2.1x additional speedup. MediaTek's Dimensity 9500 became the first mobile chipset with native BitNet 1.58-bit processing support, claiming 33% lower power.

However, no native 1-bit model larger than 2B has been released as of February 2026. Microsoft has indicated plans to explore 7B and 13B scales, but these remain future work. PT-BitNet, a post-training quantization method, can convert existing models to 1.58-bit up to 70B parameters, achieving 61% average downstream accuracy -- but with meaningful quality tradeoffs. The practical verdict: BitNet is promising but has stalled at 2B scale, while conventional 2-bit and 4-bit quantization techniques (ParetoQ, SpinQuant) have proven more immediately impactful.

Multimodal AI arrives on phones

The most significant shift in on-device AI is the move from text-only to multimodal models processing images, audio, and text together without cloud connectivity.

Commercial multimodal deployments

Gemini Nano with Multimodality debuted on the Pixel 9 series in August 2024 and processes text, images, and audio on-device. The multimodal version is nearly 2x larger than the original text-only Nano-2 (3.25B). It powers Pixel Screenshots (image understanding), TalkBack (accessibility), Call Notes (audio summarization), and Scam Detection (real-time audio pattern recognition). Gemini Nano v3 launched on Pixel 10 with full multimodal capabilities, and Google's ML Kit GenAI APIs opened Nano to third-party developers.

Apple Intelligence gained visual capabilities in 2025, with the on-device ~3B model incorporating image data from 10B+ image-text pairs during pre-training. Visual Intelligence lets users search and ask questions about camera/screen content. Image Playground generates images on-device, and Genmoji creates custom emoji. iOS 26 expanded Visual Intelligence from Camera Control to work on any screen content.

Research and open-source multimodal models

Gemma 3n set the standard with four-modality input (text, image, audio, video) in an effective-2B/4B footprint. MiniCPM-o 4.5³⁰ (open-sourced February 2026, published in Nature Communications) pushed further with full-duplex multimodal live streaming -- a 9B model that can see, listen, and speak simultaneously, with voice cloning and bilingual real-time speech. Its predecessor MiniCPM-V 2.5 (8B) surpassed GPT-4V on 11 benchmarks at 6-8 tok/s on phones, and MiniCPM-V 4.0 (August 2025) packs vision-language into just 4B parameters while surpassing GPT-4.1-mini on OpenCompass image understanding.

SmolVLM³¹ from Hugging Face delivers vision-language in 256M parameters using under 1 GB GPU memory -- outperforming the 300x larger Idefics-80B. SmolVLM2 (2025) added video understanding in 256M, 500M, and 2.2B sizes. Moondream 0.5B runs as a 479 MiB download with under 1 GB runtime memory. Microsoft's Phi-4-multimodal (5.6B) handles speech, vision, and text through a mixture-of-LoRAs architecture -- demonstrated running on iPhone 12 Pro.

Meituan's MobileVLM V2³² achieves 21.5 tok/s on Snapdragon 888 CPU -- proving multimodal inference works even on 2021-era hardware. On-device speech processing matured through WhisperKit (Swift/Core ML for iOS, QNN for Android), with Whisper Large V3 Turbo running entirely on-device for 100+ language transcription.

Inference frameworks

The established leaders

ExecuTorch (Meta) reached 1.0 GA in October 2025 and v1.1.0 in January 2026, emerging as the production standard. Its 50 KB base runtime -- the smallest of any framework -- supports 12+ hardware backends including Apple Core ML, Qualcomm QNN/Hexagon NPU, Arm XNNPACK with KleidiAI, MediaTek, Samsung Exynos, Vulkan, and NXP. It powers on-device AI across Instagram, WhatsApp, Messenger, Quest 3, and Ray-Ban Smart Glasses (billions of users). Native PyTorch export (torch.export() to .pte files) eliminates ONNX or TFLite conversion. New capabilities include multimodal LLM APIs, LoRA inference, 4-bit HQQ quantization, and experimental WASM/JavaScript support. Over 80% of popular edge LLMs on HuggingFace work out of the box.

llama.cpp remains the de facto standard for community-driven inference, with ~91K GitHub stars and multiple weekly releases. Its CPU-first philosophy and GGUF single-file format make it the most portable option. It offers the most extensive quantization support (Q2_K through Q8_0, plus importance-based IQ quantization), speculative decoding (180+ tok/s with 1B draft + 8B target on MacBook M1), and Android NDK support. Key improvements through early 2026 include faster FlashAttention for GQA, vendor-tuned OpenCL kernels for Qualcomm Adreno GPUs, and Vulkan backend unification.

MLC LLM, built on Apache TVM, takes a GPU-first approach with backends for Metal, Vulkan, OpenCL, and WebGPU. It provides OpenAI-compatible APIs and implements paged KV caching for superior long-context handling. However, non-Apple mobile GPU performance is disappointing -- only 5-20% ALU utilization on Mali/Adreno. Its WebLLM variant enables browser-based inference via WebGPU.

MNN-LLM (Alibaba) delivers the fastest mobile inference in benchmarks: 8.6x prefill speedup over llama.cpp on CPU and 25.3x on GPU (tested on Xiaomi 14). It uses automatic region fusion and DRAM-Flash hybrid storage that spills KV cache to Flash memory when DRAM fills up -- enabling long-context inference on memory-constrained devices.

Vendor-specific and new entrants

MediaPipe LLM Inference API (Google) integrates with the Google AI Edge ecosystem, supporting Gemma 3n, Gemma 3, and Gemma 2 models with multimodal prompting. Apple Core ML and MLX serve the Apple ecosystem, with Core ML targeting the Neural Engine with INT4 quantization support and MLX achieving the highest throughput on Apple Silicon (~230 tok/s on M2 Ultra, up to 525 tok/s on M4 Max).

Cactus v1 (December 2025), from a Y Combinator startup, delivers cross-platform inference with sub-50ms time-to-first-token, INT8 quantization with NPU acceleration, and benchmarks of 173 tok/s on Mac M4 Pro, 136 tok/s on iPhone 17 Pro, 91 tok/s on Galaxy S25 Ultra. MLLM v2 (November 2025) introduced full-graph NPU execution via Qualcomm's QNN backend, enabling entire transformer models to run on NPU without CPU/GPU fallback.

Apple's Foundation Models framework (WWDC June 2025) gives developers free access to the on-device Apple Intelligence model with guided generation, tool calling, and structured output -- but exclusively Apple's own model on Apple devices.

Benchmarks

Performance varies enormously depending on model size, quantization, hardware, and framework. The following data synthesizes benchmarks from llama.cpp, the "Understanding LLMs in Your Pockets" study³³, ExecuTorch documentation, PowerInfer-2, and vendor announcements.

Decode speed by model size and device (early 2025 baselines)

Model	Quant	iPhone 16 (A18)	iPhone 15 Pro (A17)	Galaxy S24+ (SD 8 Gen 3)	Pixel 9 (Tensor G4)
TinyLlama 1.1B	Q4_0	70 tok/s	57 tok/s	~40 tok/s	--
Llama 3.2 1B	INT4	--	--	40+ tok/s	--
Llama 3.2 3B	Q4	~25-30 tok/s	~20 tok/s	~15-20 tok/s	--
Gemini Nano ~3.5B	Proprietary	--	--	--	45 tok/s
Llama-2 7B	Q4 (CPU)	--	--	~10 tok/s	--
Llama-2 7B	Q4 (NPU)	--	--	~20 tok/s	--

New records on 2025-2026 hardware

Benchmark	Result	Hardware/framework
NPU decode	220 tok/s	Snapdragon 8 Elite Gen 5
NPU prefill (FastVLM)	11,000+ tok/s	SD 8 Elite Gen 5 + LiteRT QNN
Gemma 3 1B prefill	2,585 tok/s	Mobile GPU
Sub-1B model decode	136 tok/s	iPhone 17 Pro (Cactus SDK)
Sub-1B model decode	91 tok/s	Galaxy S25 Ultra (Cactus SDK)
MNN-LLM prefill	25.3x faster than llama.cpp	Xiaomi 14 GPU
OPPO ColorOS 16	300 tok/s (claimed peak)	On-device

Prefill speeds are substantially faster than decode: ExecuTorch achieves 350+ tok/s prefill for Llama 3.2 1B INT4 on the Galaxy S24+, enabling a 600-token message to be summarized with under 2 seconds time-to-first-token. NPU acceleration provides 10-50x prefill speedup over CPU on the same SoC.

Apple devices significantly outperform Android in GPU-accelerated workloads. Apple's M1 GPU delivers approximately 7x faster prefill than the Adreno 750 in the Snapdragon 8 Gen 3. On Android, CPU inference via llama.cpp often beats GPU inference due to low ALU utilization on Mali and Adreno GPUs.

For the sweet spot of 3B-8B models at 4-bit quantization, typical decode speeds on flagship 2025-2026 hardware range from 20-50 tokens/second -- comfortably above the ~15 tok/s threshold for fluid conversational interaction.

On-device training: from research to working prototypes

The most surprising development of this period was the emergence of actual on-device fine-tuning frameworks.

MobileFineTuner³⁴ (December 2025, accepted at MobiSys '26) is the first unified open-source C++ framework enabling end-to-end LLM fine-tuning directly on commodity mobile phones, supporting both full-parameter and LoRA fine-tuning of GPT-2, Gemma 3, and Qwen 2.5. It uses ZeRO-inspired parameter sharding, gradient accumulation, and energy-aware scheduling to work within phones' 4-16GB RAM constraint.

QVAC Fabric LLM³⁵ (December 2025) achieved the first LoRA fine-tuning on smartphone GPUs -- Qualcomm Adreno, ARM Mali, and Apple GPUs -- validating email style transfer and biomedical QA tasks on phone hardware. Built on the llama.cpp ecosystem, it works cross-platform.

Server-assisted approaches offer practical compromises. PAE MobiLLM (July 2025) achieves 13.1x reduction in on-device FLOPs through a one-time forward-pass design with activation caching. Fed MobiLLM³⁶ (August 2025) brings federated fine-tuning to mobile LLMs with 5.1x faster convergence than prior federated methods, enabling distributed model improvement across devices without centralizing user data.

These remain research prototypes, but they demonstrate that personalized on-device models -- where phones learn user preferences, writing style, and domain vocabulary without data leaving the device -- are technically feasible.

Commercial deployments across every major OEM

Apple

iOS 26 ships the Foundation Models framework, expanding on-device capabilities to system-wide Writing Tools, Call Screening with real-time transcription, Hold Assist, and Visual Intelligence on any screen content. An LLM-backed Siri is expected in iOS 26.4 (early 2026). Apple announced a partnership with Google in January 2026 to integrate Gemini models via Private Cloud Compute for tasks beyond the on-device model's capabilities.

Google

The Pixel 10 series (August 2025) debuted the most AI-forward Android experience. Magic Cue proactively connects information across Gmail, Calendar, Screenshots, and Messages -- powered entirely on-device by Tensor G5 and Gemini Nano for privacy. Voice Translate performs real-time phone call translation with voice cloning in 10 languages, all on-device. Developer access expanded through ML Kit GenAI APIs and AI Edge SDK.

Samsung

Galaxy AI reached 400+ million devices globally, with approximately two-thirds of users engaging regularly. The Galaxy S26 series (launching February 25, 2026) introduces EdgeFusion -- reportedly capable of generating images from text prompts in ~1 second entirely on-device -- alongside a multimodal AI camera accepting natural language editing instructions.

Chinese OEMs

OPPO's ColorOS 16 claims 300 tokens/second on-device and introduced an AI Agent Matrix for cross-device task planning, targeting 100 million generative AI users. OPPO developed a 7B on-device LLM with the AndesVL mobile VLM family (0.6B-4B).

Huawei launched HarmonyOS 6 (beta June 2025) with an AI Agent Framework enabling 50+ prebuilt agents, pivoting from app-centric to agent-centric UX powered by its Pangu 5.5 models (up to 718B in the cloud, open-sourced 7B/72B variants).

Vivo ships BlueLM-3B among the first device-cloud hybrid LLMs. Xiaomi has MiLM (up to 13B), and Honor's Magic8 Pro ships proprietary low-bit quantization.

Where Octomil fits in this landscape

The on-device training developments -- MobileFineTuner, QVAC Fabric, and especially Fed MobiLLM's federated fine-tuning -- point directly at the problem Octomil solves. These research prototypes demonstrate that phones can fine-tune models locally, but they leave the hard coordination problem unsolved: how do you orchestrate training across thousands of heterogeneous devices, aggregate updates without centralizing data, manage model versions, and roll out improvements safely?

That is exactly what Octomil's platform handles:

Federated orchestration coordinates on-device training across device fleets, managing round selection, update aggregation (9 strategies including FedAvg, FedProx, Krum, and SCAFFOLD), and convergence monitoring -- the missing layer between "a phone can fine-tune" and "a fleet learns together."
Model compression pipeline (PyTorch to ONNX to CoreML/TFLite) maps directly to the quantization and framework landscape described above, with automated conversion for iOS and Android deployment.
Progressive rollouts let teams deploy updated models to 5% of devices, validate metrics, and scale up -- the same canary deployment pattern that Google and Apple use internally but packaged as a product.
Privacy-preserving training with differential privacy, secure aggregation, and gradient clipping ensures that on-device personalization stays private -- critical as regulations tighten around on-device data processing.

To learn more: Advanced FL Configuration | Federated LLMs Guide | Advanced FL Concepts

Conclusion

The period from early 2025 through February 2026 marked the transition of on-device LLM inference from technical demonstration to mainstream product feature. Three developments stand out as genuinely transformative.

First, mobile NPUs crossing 100 TOPS (Snapdragon 8 Elite Gen 5, Dimensity 9500) eliminated compute as the primary bottleneck. Memory bandwidth is now the binding constraint, and LPDDR6 arriving in mid-2026 will be the next unlock.

Second, Gemma 3n's MatFormer architecture and Per-Layer Embeddings established a new design paradigm for mobile models -- elastic inference that adapts compute to the task, with parameters spilling to storage rather than occupying precious RAM. This, combined with ParetoQ's finding that 2-bit models can outperform 4-bit models at half the size, suggests the quality-per-byte curve still has significant room to improve.

Third, the emergence of working on-device fine-tuning (MobileFineTuner, QVAC Fabric, Fed MobiLLM) opens the door to personalized models that learn from user behavior without cloud dependency. When combined with federated learning orchestration, this enables a future where device fleets collectively improve models while keeping all training data on-device -- the privacy-preserving AI future that the industry has promised.

The framework war is consolidating around ExecuTorch (production mobile), llama.cpp (prototyping and community), and vendor-specific stacks. BitNet's promise of 1-bit inference has stalled at 2B scale -- practically useful but not yet transformative. The real surprise is multimodal: models like Gemma 3n and MiniCPM-o 4.5 can now process text, images, audio, and video simultaneously on a phone, a capability that was cloud-only just a year ago.

The edge AI market, projected to grow from $15.2B (2022) to $143.6B (2032), will drive continued rapid progress across all these fronts. The question is no longer whether LLMs can run on phones -- they already do, for hundreds of millions of users. The question is what becomes possible when every phone can not just run models, but learn from its own data and contribute to collective intelligence without compromising privacy.

References

Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," MLSys 2024. arXiv:2306.00978 ↩
Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," ICLR 2023. arXiv:2210.17323 ↩
Gerganov et al., "llama.cpp," GitHub. github.com/ggerganov/llama.cpp ↩
Liu et al., "SpinQuant: LLM Quantization with Learned Rotations," ICLR 2025. arXiv:2405.16406 ↩
Egiazarian et al., "AQLM: Extreme Compression of Large Language Models via Additive Quantization," ICML 2024. arXiv:2401.06118 ↩
Chee et al., "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks," ICML 2024. arXiv:2307.13304 ↩
Tseng et al., "ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization," NeurIPS 2025. arXiv:2502.02631 ↩
Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints," EMNLP 2023. arXiv:2305.13245 ↩
MLC Team, "MLC LLM: Universal LLM Deployment Engine," GitHub. github.com/mlc-ai/mlc-llm ↩
Hooper et al., "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization," NeurIPS 2024. arXiv:2401.18079 ↩
Sun et al., "A Simple and Effective Pruning Approach for Large Language Models" (Wanda), ICLR 2024. arXiv:2306.11695 ↩
Liu et al., "MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases," ICML 2024. arXiv:2402.14905 ↩
Zhang et al., "EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models," IEEE Transactions on Mobile Computing, 2024. ↩
Chen et al., "Universal Speculative Decoding with Draft Model Independence," ICML 2025. ↩
Jiang et al., "MNN-LLM: A Generic Inference Engine for Fast Large Language Model on Mobile Devices," ACM Multimedia Asia, 2025. ↩
Meta, "ExecuTorch: End-to-End Solution for Enabling On-Device AI," GitHub. github.com/pytorch/executorch ↩
Wang et al., "llm.npu: Deploying Large Language Models on Commercial Mobile NPUs," ASPLOS 2025. ↩
Xue et al., "PowerInfer-2: Fast Large Language Model Inference on a Smartphone," ACM MobiCom, 2024. arXiv:2406.06282 ↩
Meta, "Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models," Meta AI Blog, September 2024. ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices ↩
Allal et al., "SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model," Hugging Face, November 2024. huggingface.co/blog/smollm2 ↩
Yang et al., "Qwen2.5 Technical Report," Alibaba, 2024. arXiv:2412.15115 ↩
Abdin et al., "Phi-4 Technical Report," Microsoft Research, December 2024. arXiv:2412.08905 ↩
Google, "Gemma 3n: Lightweight Multimodal Models for Mobile Devices," Google AI Blog, May 2025. ai.google.dev/gemma ↩
Yang et al., "Qwen3 Technical Report," Alibaba, April 2025. qwenlm.github.io/blog/qwen3 ↩
Allal et al., "SmolLM3: A Small but Mighty Reasoning Model," Hugging Face, July 2025. huggingface.co/blog/smollm3 ↩
Mistral AI, "Ministral 3: Small Models for Edge Deployment," Mistral Blog, December 2025. ↩
Apple, "Apple Foundation Models Technical Report," Apple Machine Learning Research, July 2025. machinelearning.apple.com/research/apple-foundation-models ↩
Ma et al., "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits," Microsoft Research, 2024. arXiv:2402.17764 ↩
Xu et al., "BitNet b1.58 2B4T Technical Report," Microsoft Research, April 2025. arXiv:2504.12285 ↩
Yao et al., "MiniCPM-o: A Fully Open-Source GPT-4o Level MLLM for Single-Image, Multi-Image, Video, and Live-Stream," Nature Communications, 2026. ↩
Marafioti et al., "SmolVLM: Redefining Small and Efficient Multimodal Models," Hugging Face, 2025. huggingface.co/blog/smolvlm ↩
Chu et al., "MobileVLM V2: Faster and Stronger Baseline for Vision Language Model," 2024. arXiv:2402.03766 ↩
Xu et al., "Understanding LLMs in Your Pockets: A Comprehensive Study of On-Device LLM Inference," 2024. arXiv:2410.03613 ↩
MobileFineTuner, "End-to-End LLM Fine-Tuning on Commodity Mobile Phones," MobiSys 2026. ↩
QVAC, "Fabric LLM: LoRA Fine-Tuning on Smartphone GPUs," December 2025. ↩
Fed MobiLLM, "Federated Fine-Tuning for Mobile Large Language Models," August 2025. ↩

The optimization toolkit​

Quantization: from 4-bit standard to 2-bit frontier​

KV-cache optimization and attention innovations​

Pruning, distillation, and architectural innovations​

Speculative decoding and operator fusion​

Hardware: the 100 TOPS NPU era​

The mobile SoC AI race through early 2025​

The 2025-2026 leap: five new flagship chipsets​

Memory bandwidth remains the wall​

What phones can actually run​

Model size versus device capability​

Mobile-first small language models​

The first wave: late 2024 through early 2025​

The second wave: mid-2025​

BitNet and 1-bit inference​

Multimodal AI arrives on phones​

Commercial multimodal deployments​

Research and open-source multimodal models​

Inference frameworks​

The established leaders​

Vendor-specific and new entrants​

Benchmarks​

Decode speed by model size and device (early 2025 baselines)​

New records on 2025-2026 hardware​

On-device training: from research to working prototypes​

Commercial deployments across every major OEM​

Apple​

Google​

Samsung​

Chinese OEMs​

Where Octomil fits in this landscape​

Conclusion​

References​

Footnotes​