Skip to main content

2 posts tagged with "model-compression"

View All Tags

On-Device LLM Inference: The Definitive 2025-2026 Guide

· 30 min read

In under two years, on-device language models went from research curiosity to mainstream product feature. Smartphones can now run language models with up to 47 billion parameters. Flagship NPUs have crossed the 100 TOPS threshold. Multimodal models process text, images, audio, and video without cloud connectivity. And the first frameworks enabling actual on-device fine-tuning have arrived.

This guide covers the full arc from early 2025 through February 2026: optimization techniques, hardware capabilities, model releases, inference frameworks, performance benchmarks, commercial deployments, and the emerging frontier of on-device training. It is the most comprehensive single resource on mobile LLM inference available today.

Model Compression for Edge Devices: Making LLMs Run on Smartphones

· 10 min read

The irony of modern federated learning: We want to train sophisticated models on edge devices, but those same devices often can't even run the models.

A state-of-the-art language model has 7B+ parameters (~28 GB at 32-bit). An iPhone 15 Pro has 8 GB RAM. This math doesn't work.

Model compression techniques—quantization, pruning, low-rank adaptation—are not just optimizations; they're prerequisites for production FL on edge devices. This post explores cutting-edge compression methods and how Octomil makes them accessible.