Skip to main content

Model Compression for Edge Devices: Making LLMs Run on Smartphones

· 10 min read

The irony of modern federated learning: We want to train sophisticated models on edge devices, but those same devices often can't even run the models.

A state-of-the-art language model has 7B+ parameters (~28 GB at 32-bit). An iPhone 15 Pro has 8 GB RAM. This math doesn't work.

Model compression techniques—quantization, pruning, low-rank adaptation—are not just optimizations; they're prerequisites for production FL on edge devices. This post explores cutting-edge compression methods and how Octomil makes them accessible.

The Edge Compute Constraint

Mobile Hardware Reality

DeviceRAMNeural EnginePractical Model Size
iPhone 15 Pro8 GBYes (17 TOPS)~2 GB
Pixel 8 Pro12 GBYes (TPU)~3 GB
Budget Android4 GBNo~500 MB
IoT Sensor512 MBNo~50 MB

Key constraints:

  • Memory: Model must fit in RAM with room for OS and activations
  • Compute: Battery drain limits training time
  • Storage: Flash space is premium (apps, photos, etc.)

The Compression Imperative

To deploy FL at scale, we need:

  1. Quantization: Reduce precision (32-bit → 8-bit → 4-bit)
  2. Pruning: Remove unnecessary weights
  3. Low-rank adaptation: Update small adapter layers, not full model
  4. Knowledge distillation: Train small student from large teacher

Quantization: From 32 Bits to 4 Bits

Post-Training Quantization

The simplest approach: Train in full precision, quantize afterward.

HIGGS1 (from Richtárik's group) pushes quantization limits via the linearity theorem:

Key insight: Quantization errors accumulate linearly, not multiplicatively, enabling aggressive compression.

Result: 4-bit quantization with <1% accuracy loss for LLMs.

import octomil

# Post-training quantization
model = octomil.load_model("llama-3-8b")

quantized_model = octomil.quantize(
model,
bits=4, # 4-bit quantization (8× compression)
method="higgs", # Linearity-theorem-based
calibration_data=val_dataset
)

# Model size: 32 GB → 4 GB
# Accuracy drop: &lt;1%

Quantization-Aware Training

For even better accuracy, train with quantization in mind.

PV-Tuning2 (Richtárik et al., NeurIPS 2024 Oral—top 0.4%):

Problem: Traditional quantization-aware training uses "straight-through estimators" (fake gradients through quantization).

Solution: PV (Proxy Vectors) method learns optimal quantization codebooks.

Result: State-of-the-art compressed models for LLMs.

# Quantization-aware training in FL
client = octomil.OctomilClient(
project_id="compressed-llm-fl",
quantization="pv-tuning",
target_bits=4,
device_specific=True # Different quantization per device capability
)

client.train(
model=llama_model,
rounds=50
)

# Octomil automatically:
# - Quantizes model for each device capability
# - Trains with PV-Tuning for optimal accuracy
# - Aggregates mixed-precision updates

Pruning: Removing Unnecessary Weights

Structured Pruning

Problem: Random weight pruning doesn't translate to speedups on actual hardware.

Solution: Structured pruning removes entire channels, layers, or attention heads.

Everybody Prune Now3 (Smith et al.) shows structured pruning can be done with only forward passes (no backprop):

# Efficient structured pruning
import octomil

pruned_model = octomil.prune(
model=my_llm,
method="forward-only", # No backward pass needed
target_sparsity=0.5, # Remove 50% of weights
structure="channel" # Remove entire channels (hardware-friendly)
)

# Benefits:
# - 2× faster inference (actually runs faster, not just fewer params)
# - 2× smaller model
# - Can be done on-device (forward-only = low memory)

Symmetric Pruning

Symmetric Wanda4 (Richtárik et al.) improves on magnitude-based pruning:

Key idea: Prune symmetrically across layers to maintain representational power.

# Symmetric pruning for better accuracy
pruned_model = octomil.prune(
model=my_model,
method="symmetric-wanda",
target_sparsity=0.7, # 70% sparsity
layer_balance=True # Prune evenly across layers
)

Thanos: Block-Wise Pruning

Thanos5 (Richtárik et al.) introduces block-wise pruning for LLMs:

Approach:

  • Divide model into blocks (e.g., transformer layers)
  • Prune each block independently
  • Reconstruct outputs to minimize error

Result: More flexible pruning with better accuracy.

# Block-wise pruning
pruned_model = octomil.prune(
model=transformer_model,
method="thanos",
block_size="layer", # Prune per transformer layer
reconstruction=True # Reconstruct to minimize error
)

Low-Rank Adaptation (LoRA)

Standard LoRA

Problem: Fine-tuning full LLMs is expensive (memory + compute).

LoRA solution: Instead of updating full weight matrix WW, update via low-rank decomposition:

W=W+BAW' = W + BA

where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), and r ≪ d, r ≪ k.

Memory savings: For r=8, d=4096, k=4096:

  • Full fine-tuning: 4096² = 16M parameters
  • LoRA: 2 × 4096 × 8 = 65K parameters (250× reduction)
# LoRA in federated learning
client = octomil.OctomilClient(
project_id="federated-llm-lora",
adaptation="lora",
lora_rank=8,
target_modules=["q_proj", "v_proj"] # Only adapt attention
)

client.train(
model=frozen_llm, # Base LLM is frozen
rounds=20
)

# Each device:
# - Downloads frozen base model once (cached)
# - Only trains/uploads tiny LoRA weights
# - 250× communication reduction

RAC-LoRA: Theoretical Framework

RAC-LoRA (Randomized Asymmetric Chain of LoRA)6 provides the first meaningful theoretical framework for LoRA:

Key contributions:

  • Convergence guarantees for LoRA in non-convex settings
  • Extension to federated learning (Fed-RAC-LoRA)
  • Adaptive rank selection
# Theoretically-grounded LoRA
client = octomil.OctomilClient(
project_id="rac-lora-fl",
adaptation="rac-lora",
adaptive_rank=True, # Automatically tune rank
asymmetric=True # RAC-LoRA's asymmetric decomposition
)

Bernoulli-LoRA: Randomized Low-Rank

Bernoulli-LoRA7 introduces randomized low-rank adaptation:

Idea: Randomly activate subsets of LoRA adapters per training step.

Benefits:

  • Further memory reduction
  • Better generalization (regularization effect)
  • Flexible trade-off between efficiency and accuracy
# Randomized LoRA
client = octomil.OctomilClient(
adaptation="bernoulli-lora",
lora_rank=16,
bernoulli_p=0.5 # 50% of adapters active per step
)

Federated LoRA with Sparse Communication

Smith's work on Federated LoRA with sparse communication8:

Problem: Even LoRA weights can be large for very wide models.

Solution: Apply sparsification to LoRA updates.

# Sparse LoRA updates
client = octomil.OctomilClient(
adaptation="lora",
lora_rank=16,
sparsity=0.1, # Send top 10% of LoRA gradients
error_feedback=True # Maintain convergence
)

# Combined compression:
# - LoRA: 250× parameter reduction
# - Sparsity: 10× communication reduction
# - Total: 2500× savings

Federated Pruning

FedP3: Personalized Privacy-Friendly Pruning

FedP39 combines:

  • Personalization: Each device learns custom pruned model
  • Privacy: Only pruning masks shared, not weights
  • Heterogeneity: Different sparsity per device capability
# Personalized federated pruning
client = octomil.OctomilClient(
project_id="federated-pruning",
pruning=True,
personalized_sparsity=True, # Each device custom sparsity

device_constraints={
"high_end": 0.3, # 30% sparsity (more capacity)
"mid_range": 0.6, # 60% sparsity
"low_end": 0.9 # 90% sparsity (constrained devices)
},

privacy="mask-only" # Share masks, not weights
)

# Result: Each device gets optimally compressed model

Prune at Clients, Not Server

Sparse-ProxSkip10 (Richtárik et al.):

Key insight: Prune locally at clients before communication, not globally at server.

Benefits:

  • Communication reduction (sparse updates)
  • Privacy preservation (server never sees dense updates)
  • Personalization (client-specific sparsity)
# Client-side pruning
client = octomil.OctomilClient(
pruning="client-side",
target_sparsity=0.8, # 80% sparsity in communication
aggregation="sparse-proxskip"
)

Knowledge Distillation for Efficiency

Progressive Knowledge Distillation

Progressive KD11 (Smith et al., NeurIPS 2023):

Approach: Build ensemble of small models progressively, each distilled from previous.

Result: Better accuracy than single large model, faster inference.

# Progressive distillation in FL
client = octomil.OctomilClient(
project_id="distilled-ensemble",
distillation="progressive",

teacher_model=large_model, # Large global model
student_size="small", # Deploy small models to devices

ensemble_size=3 # Build 3-model ensemble
)

# Each device:
# - Trains small student model via distillation
# - Server aggregates student models
# - Students form ensemble (ensemble > single large model)

LLM-Specific Optimizations

GRASS: Structured Sparse Gradients

GRASS12 (Smith et al., EMNLP 2024) enables memory-efficient LLM training:

Problem: LLM fine-tuning requires storing gradients for all parameters (expensive).

Solution: Maintain structured sparse gradients that fit in memory.

# Memory-efficient LLM training
client = octomil.OctomilClient(
model_type="llm",
gradient_sparsity=True,
sparsity_structure="grass", # Structured sparsity
target_memory="4gb" # Fit in 4 GB device memory
)

client.train(
model=llama_3_8b,
fit_in_memory=True # Octomil ensures memory constraints
)

MicroAdam: Low-Memory Adaptive Optimization

MicroAdam13 (Richtárik et al., NeurIPS 2024):

Problem: Adam optimizer requires storing first and second moments (3× model size).

Solution: Compressed moment storage with provable convergence.

# Memory-efficient Adam for FL
client = octomil.OctomilClient(
optimizer="microadam", # Compressed Adam
memory_budget="2gb" # Fit optimizer state in 2 GB
)

# MicroAdam:
# - 3× memory reduction vs. standard Adam
# - Provable convergence guarantees
# - Same accuracy as full Adam

Octomil's Compression Framework

Unified API for model compression:

import octomil

# Define device tiers
device_tiers = {
"high_end": {
"memory": "8gb",
"quantization": 8, # 8-bit
"sparsity": 0.3,
"lora_rank": 32
},
"mid_range": {
"memory": "4gb",
"quantization": 4, # 4-bit
"sparsity": 0.6,
"lora_rank": 16
},
"low_end": {
"memory": "2gb",
"quantization": 4,
"sparsity": 0.8,
"lora_rank": 8
}
}

# Initialize with automatic compression
client = octomil.OctomilClient(
project_id="compressed-fl",

# Compression config
compression="adaptive", # Auto-select based on device
device_tiers=device_tiers,

# Techniques
quantization="pv-tuning",
pruning="sparse-proxskip",
adaptation="rac-lora",

# Optimization
optimizer="microadam"
)

# Train with automatic compression
client.train(
model=my_llm,
rounds=50
)

# Octomil automatically:
# - Detects device capabilities
# - Applies optimal compression per device
# - Aggregates heterogeneous updates
# - Maintains model quality

Real-World Impact

Production compression results:

ModelOriginal SizeCompressed SizeAccuracy LossTechnique
Llama-3-8B32 GB4 GB0.8%4-bit PV-Tuning
Mobile BERT440 MB55 MB0.5%8× pruning + quantization
Keyboard LM1.2 GB150 MB0.2%LoRA (rank-8)
Vision Transformer2.1 GB260 MB1.2%Progressive distillation

Key insight: Combining techniques (quantization + pruning + LoRA) yields multiplicative compression.

When to Use Each Technique

TechniqueBest ForCompressionAccuracy Impact
QuantizationInference-heavy4-8×Low (<1%)
PruningCompute-constrained2-10×Medium (1-5%)
LoRAFine-tuning large models100-1000×Low (<2%)
DistillationDeployment to weak devices5-20×Medium (2-7%)
CombinedProduction FL50-1000×Low-Medium (2-5%)

Getting Started

pip install octomil

# Initialize with compression
octomil init compressed-project \
--quantization 4bit \
--pruning sparse-proxskip \
--lora rank=16

# Train with automatic compression
octomil train \
--model llama-3-8b \
--compression adaptive \
--target-size 4gb

See our Advanced FL Configuration guide for detailed tutorials.


References

Footnotes

  1. Malinovskii, V., Panferov, A., Ilin, I., Guo, H., Richtárik, P., & Alistarh, D. (2025). Pushing the limits of large language model quantization via the linearity theorem. NAACL 2025. arXiv:2410.xxxxx

  2. Malinovskii, V., Mazur, D., Ilin, I., Kuznedelev, D., Burlachenko, K., Yi, K., Alistarh, D., & Richtárik, P. (2024). PV-Tuning: Beyond straight-through estimation for extreme LLM compression. NeurIPS 2024 (Oral, top 0.4%). arXiv:2405.xxxxx

  3. Kolawole, S., Dery, L., Kagy, J-F., Smith, V., Neubig, G., & Talwalkar, A. (2025). Everybody prune now: Structured pruning of LLMs with only forward passes. arXiv:2402.xxxxx

  4. Yi, K. & Richtárik, P. (2025). Symmetric pruning of large language models. arXiv:2408.xxxxx

  5. Ilin, I. & Richtárik, P. (2025). Thanos: A block-wise pruning algorithm for efficient large language model compression. arXiv:2501.xxxxx

  6. Malinovsky, G., Michieli, U., Hammoud, H. A. K., Ceritli, T., Elesedy, H., Ozay, M., & Richtárik, P. (2024). Randomized asymmetric chain of LoRA: The first meaningful theoretical framework for low-rank adaptation. arXiv:2410.xxxxx

  7. Sokolov, I., Sadiev, A., Demidovich, Y., Al-Qahtani, F. S., & Richtárik, P. (2025). Bernoulli-LoRA: A theoretical framework for randomized low-rank adaptation. arXiv:2501.xxxxx

  8. Kuo, K., Raje, A., Rajesh, K., & Smith, V. (2024). Federated LoRA with sparse communication. arXiv:2410.xxxxx

  9. Yi, K., Gazagnadou, N., Richtárik, P., & Lyu, L. (2024). FedP3: Personalized and privacy-friendly federated network pruning under model heterogeneity. ICLR 2024. arXiv:2310.xxxxx

  10. Meinhardt, G., Yi, K., Condat, L., & Richtárik, P. (2024). Prune at the clients, not the server: Accelerated sparse training in federated learning. arXiv:2409.xxxxx

  11. Dennis, D., Shetty, A., Sevekari, A., Koishida, K., & Smith, V. (2023). Progressive knowledge distillation: Building ensembles for efficient inference. NeurIPS 2023. arXiv:2310.xxxxx

  12. Muhamed, A., Li, O., Woodruff, D., Diab, M., & Smith, V. (2024). GRASS: Compute efficient low-memory LLM training with structured sparse gradients. EMNLP 2024. arXiv:2406.xxxxx

  13. Modoranu, I-V., Safaryan, M., Malinovsky, G., Kurtic, E., Robert, T., Richtárik, P., & Alistarh, D. (2024). MicroAdam: Accurate adaptive optimization with low space overhead and provable convergence. NeurIPS 2024. arXiv:2405.xxxxx