Model Compression for Edge Devices: Making LLMs Run on Smartphones
The irony of modern federated learning: We want to train sophisticated models on edge devices, but those same devices often can't even run the models.
A state-of-the-art language model has 7B+ parameters (~28 GB at 32-bit). An iPhone 15 Pro has 8 GB RAM. This math doesn't work.
Model compression techniques—quantization, pruning, low-rank adaptation—are not just optimizations; they're prerequisites for production FL on edge devices. This post explores cutting-edge compression methods and how Octomil makes them accessible.
The Edge Compute Constraint
Mobile Hardware Reality
| Device | RAM | Neural Engine | Practical Model Size |
|---|---|---|---|
| iPhone 15 Pro | 8 GB | Yes (17 TOPS) | ~2 GB |
| Pixel 8 Pro | 12 GB | Yes (TPU) | ~3 GB |
| Budget Android | 4 GB | No | ~500 MB |
| IoT Sensor | 512 MB | No | ~50 MB |
Key constraints:
- Memory: Model must fit in RAM with room for OS and activations
- Compute: Battery drain limits training time
- Storage: Flash space is premium (apps, photos, etc.)
The Compression Imperative
To deploy FL at scale, we need:
- Quantization: Reduce precision (32-bit → 8-bit → 4-bit)
- Pruning: Remove unnecessary weights
- Low-rank adaptation: Update small adapter layers, not full model
- Knowledge distillation: Train small student from large teacher
Quantization: From 32 Bits to 4 Bits
Post-Training Quantization
The simplest approach: Train in full precision, quantize afterward.
HIGGS1 (from Richtárik's group) pushes quantization limits via the linearity theorem:
Key insight: Quantization errors accumulate linearly, not multiplicatively, enabling aggressive compression.
Result: 4-bit quantization with <1% accuracy loss for LLMs.
import octomil
# Post-training quantization
model = octomil.load_model("llama-3-8b")
quantized_model = octomil.quantize(
model,
bits=4, # 4-bit quantization (8× compression)
method="higgs", # Linearity-theorem-based
calibration_data=val_dataset
)
# Model size: 32 GB → 4 GB
# Accuracy drop: <1%
Quantization-Aware Training
For even better accuracy, train with quantization in mind.
PV-Tuning2 (Richtárik et al., NeurIPS 2024 Oral—top 0.4%):
Problem: Traditional quantization-aware training uses "straight-through estimators" (fake gradients through quantization).
Solution: PV (Proxy Vectors) method learns optimal quantization codebooks.
Result: State-of-the-art compressed models for LLMs.
# Quantization-aware training in FL
client = octomil.OctomilClient(
project_id="compressed-llm-fl",
quantization="pv-tuning",
target_bits=4,
device_specific=True # Different quantization per device capability
)
client.train(
model=llama_model,
rounds=50
)
# Octomil automatically:
# - Quantizes model for each device capability
# - Trains with PV-Tuning for optimal accuracy
# - Aggregates mixed-precision updates
Pruning: Removing Unnecessary Weights
Structured Pruning
Problem: Random weight pruning doesn't translate to speedups on actual hardware.
Solution: Structured pruning removes entire channels, layers, or attention heads.
Everybody Prune Now3 (Smith et al.) shows structured pruning can be done with only forward passes (no backprop):
# Efficient structured pruning
import octomil
pruned_model = octomil.prune(
model=my_llm,
method="forward-only", # No backward pass needed
target_sparsity=0.5, # Remove 50% of weights
structure="channel" # Remove entire channels (hardware-friendly)
)
# Benefits:
# - 2× faster inference (actually runs faster, not just fewer params)
# - 2× smaller model
# - Can be done on-device (forward-only = low memory)
Symmetric Pruning
Symmetric Wanda4 (Richtárik et al.) improves on magnitude-based pruning:
Key idea: Prune symmetrically across layers to maintain representational power.
# Symmetric pruning for better accuracy
pruned_model = octomil.prune(
model=my_model,
method="symmetric-wanda",
target_sparsity=0.7, # 70% sparsity
layer_balance=True # Prune evenly across layers
)
Thanos: Block-Wise Pruning
Thanos5 (Richtárik et al.) introduces block-wise pruning for LLMs:
Approach:
- Divide model into blocks (e.g., transformer layers)
- Prune each block independently
- Reconstruct outputs to minimize error
Result: More flexible pruning with better accuracy.
# Block-wise pruning
pruned_model = octomil.prune(
model=transformer_model,
method="thanos",
block_size="layer", # Prune per transformer layer
reconstruction=True # Reconstruct to minimize error
)
Low-Rank Adaptation (LoRA)
Standard LoRA
Problem: Fine-tuning full LLMs is expensive (memory + compute).
LoRA solution: Instead of updating full weight matrix , update via low-rank decomposition:
where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), and r ≪ d, r ≪ k.
Memory savings: For r=8, d=4096, k=4096:
- Full fine-tuning: 4096² = 16M parameters
- LoRA: 2 × 4096 × 8 = 65K parameters (250× reduction)
# LoRA in federated learning
client = octomil.OctomilClient(
project_id="federated-llm-lora",
adaptation="lora",
lora_rank=8,
target_modules=["q_proj", "v_proj"] # Only adapt attention
)
client.train(
model=frozen_llm, # Base LLM is frozen
rounds=20
)
# Each device:
# - Downloads frozen base model once (cached)
# - Only trains/uploads tiny LoRA weights
# - 250× communication reduction
RAC-LoRA: Theoretical Framework
RAC-LoRA (Randomized Asymmetric Chain of LoRA)6 provides the first meaningful theoretical framework for LoRA:
Key contributions:
- Convergence guarantees for LoRA in non-convex settings
- Extension to federated learning (Fed-RAC-LoRA)
- Adaptive rank selection
# Theoretically-grounded LoRA
client = octomil.OctomilClient(
project_id="rac-lora-fl",
adaptation="rac-lora",
adaptive_rank=True, # Automatically tune rank
asymmetric=True # RAC-LoRA's asymmetric decomposition
)
Bernoulli-LoRA: Randomized Low-Rank
Bernoulli-LoRA7 introduces randomized low-rank adaptation:
Idea: Randomly activate subsets of LoRA adapters per training step.
Benefits:
- Further memory reduction
- Better generalization (regularization effect)
- Flexible trade-off between efficiency and accuracy
# Randomized LoRA
client = octomil.OctomilClient(
adaptation="bernoulli-lora",
lora_rank=16,
bernoulli_p=0.5 # 50% of adapters active per step
)
Federated LoRA with Sparse Communication
Smith's work on Federated LoRA with sparse communication8:
Problem: Even LoRA weights can be large for very wide models.
Solution: Apply sparsification to LoRA updates.
# Sparse LoRA updates
client = octomil.OctomilClient(
adaptation="lora",
lora_rank=16,
sparsity=0.1, # Send top 10% of LoRA gradients
error_feedback=True # Maintain convergence
)
# Combined compression:
# - LoRA: 250× parameter reduction
# - Sparsity: 10× communication reduction
# - Total: 2500× savings
Federated Pruning
FedP3: Personalized Privacy-Friendly Pruning
FedP39 combines:
- Personalization: Each device learns custom pruned model
- Privacy: Only pruning masks shared, not weights
- Heterogeneity: Different sparsity per device capability
# Personalized federated pruning
client = octomil.OctomilClient(
project_id="federated-pruning",
pruning=True,
personalized_sparsity=True, # Each device custom sparsity
device_constraints={
"high_end": 0.3, # 30% sparsity (more capacity)
"mid_range": 0.6, # 60% sparsity
"low_end": 0.9 # 90% sparsity (constrained devices)
},
privacy="mask-only" # Share masks, not weights
)
# Result: Each device gets optimally compressed model
Prune at Clients, Not Server
Sparse-ProxSkip10 (Richtárik et al.):
Key insight: Prune locally at clients before communication, not globally at server.
Benefits:
- Communication reduction (sparse updates)
- Privacy preservation (server never sees dense updates)
- Personalization (client-specific sparsity)
# Client-side pruning
client = octomil.OctomilClient(
pruning="client-side",
target_sparsity=0.8, # 80% sparsity in communication
aggregation="sparse-proxskip"
)
Knowledge Distillation for Efficiency
Progressive Knowledge Distillation
Progressive KD11 (Smith et al., NeurIPS 2023):
Approach: Build ensemble of small models progressively, each distilled from previous.
Result: Better accuracy than single large model, faster inference.
# Progressive distillation in FL
client = octomil.OctomilClient(
project_id="distilled-ensemble",
distillation="progressive",
teacher_model=large_model, # Large global model
student_size="small", # Deploy small models to devices
ensemble_size=3 # Build 3-model ensemble
)
# Each device:
# - Trains small student model via distillation
# - Server aggregates student models
# - Students form ensemble (ensemble > single large model)
LLM-Specific Optimizations
GRASS: Structured Sparse Gradients
GRASS12 (Smith et al., EMNLP 2024) enables memory-efficient LLM training:
Problem: LLM fine-tuning requires storing gradients for all parameters (expensive).
Solution: Maintain structured sparse gradients that fit in memory.
# Memory-efficient LLM training
client = octomil.OctomilClient(
model_type="llm",
gradient_sparsity=True,
sparsity_structure="grass", # Structured sparsity
target_memory="4gb" # Fit in 4 GB device memory
)
client.train(
model=llama_3_8b,
fit_in_memory=True # Octomil ensures memory constraints
)
MicroAdam: Low-Memory Adaptive Optimization
MicroAdam13 (Richtárik et al., NeurIPS 2024):
Problem: Adam optimizer requires storing first and second moments (3× model size).
Solution: Compressed moment storage with provable convergence.
# Memory-efficient Adam for FL
client = octomil.OctomilClient(
optimizer="microadam", # Compressed Adam
memory_budget="2gb" # Fit optimizer state in 2 GB
)
# MicroAdam:
# - 3× memory reduction vs. standard Adam
# - Provable convergence guarantees
# - Same accuracy as full Adam
Octomil's Compression Framework
Unified API for model compression:
import octomil
# Define device tiers
device_tiers = {
"high_end": {
"memory": "8gb",
"quantization": 8, # 8-bit
"sparsity": 0.3,
"lora_rank": 32
},
"mid_range": {
"memory": "4gb",
"quantization": 4, # 4-bit
"sparsity": 0.6,
"lora_rank": 16
},
"low_end": {
"memory": "2gb",
"quantization": 4,
"sparsity": 0.8,
"lora_rank": 8
}
}
# Initialize with automatic compression
client = octomil.OctomilClient(
project_id="compressed-fl",
# Compression config
compression="adaptive", # Auto-select based on device
device_tiers=device_tiers,
# Techniques
quantization="pv-tuning",
pruning="sparse-proxskip",
adaptation="rac-lora",
# Optimization
optimizer="microadam"
)
# Train with automatic compression
client.train(
model=my_llm,
rounds=50
)
# Octomil automatically:
# - Detects device capabilities
# - Applies optimal compression per device
# - Aggregates heterogeneous updates
# - Maintains model quality
Real-World Impact
Production compression results:
| Model | Original Size | Compressed Size | Accuracy Loss | Technique |
|---|---|---|---|---|
| Llama-3-8B | 32 GB | 4 GB | 0.8% | 4-bit PV-Tuning |
| Mobile BERT | 440 MB | 55 MB | 0.5% | 8× pruning + quantization |
| Keyboard LM | 1.2 GB | 150 MB | 0.2% | LoRA (rank-8) |
| Vision Transformer | 2.1 GB | 260 MB | 1.2% | Progressive distillation |
Key insight: Combining techniques (quantization + pruning + LoRA) yields multiplicative compression.
When to Use Each Technique
| Technique | Best For | Compression | Accuracy Impact |
|---|---|---|---|
| Quantization | Inference-heavy | 4-8× | Low (<1%) |
| Pruning | Compute-constrained | 2-10× | Medium (1-5%) |
| LoRA | Fine-tuning large models | 100-1000× | Low (<2%) |
| Distillation | Deployment to weak devices | 5-20× | Medium (2-7%) |
| Combined | Production FL | 50-1000× | Low-Medium (2-5%) |
Getting Started
pip install octomil
# Initialize with compression
octomil init compressed-project \
--quantization 4bit \
--pruning sparse-proxskip \
--lora rank=16
# Train with automatic compression
octomil train \
--model llama-3-8b \
--compression adaptive \
--target-size 4gb
See our Advanced FL Configuration guide for detailed tutorials.