Model Compression for Edge Devices: Making LLMs Run on Smartphones

February 1, 2026 · 10 min read

The irony of modern federated learning: We want to train sophisticated models on edge devices, but those same devices often can't even run the models.

A state-of-the-art language model has 7B+ parameters (~28 GB at 32-bit). An iPhone 15 Pro has 8 GB RAM. This math doesn't work.

Model compression techniques—quantization, pruning, low-rank adaptation—are not just optimizations; they're prerequisites for production FL on edge devices. This post explores cutting-edge compression methods and how Octomil makes them accessible.

The Edge Compute Constraint

Mobile Hardware Reality

Device	RAM	Neural Engine	Practical Model Size
iPhone 15 Pro	8 GB	Yes (17 TOPS)	~2 GB
Pixel 8 Pro	12 GB	Yes (TPU)	~3 GB
Budget Android	4 GB	No	~500 MB
IoT Sensor	512 MB	No	~50 MB

Key constraints:

Memory: Model must fit in RAM with room for OS and activations
Compute: Battery drain limits training time
Storage: Flash space is premium (apps, photos, etc.)

The Compression Imperative

To deploy FL at scale, we need:

Quantization: Reduce precision (32-bit → 8-bit → 4-bit)
Pruning: Remove unnecessary weights
Low-rank adaptation: Update small adapter layers, not full model
Knowledge distillation: Train small student from large teacher

Quantization: From 32 Bits to 4 Bits

Post-Training Quantization

The simplest approach: Train in full precision, quantize afterward.

HIGGS¹ (from Richtárik's group) pushes quantization limits via the linearity theorem:

Key insight: Quantization errors accumulate linearly, not multiplicatively, enabling aggressive compression.

Result: 4-bit quantization with <1% accuracy loss for LLMs.

import octomil

# Post-training quantization
model = octomil.load_model("llama-3-8b")

quantized_model = octomil.quantize(
    model,
    bits=4,  # 4-bit quantization (8× compression)
    method="higgs",  # Linearity-theorem-based
    calibration_data=val_dataset
)

# Model size: 32 GB → 4 GB
# Accuracy drop: &lt;1%

Quantization-Aware Training

For even better accuracy, train with quantization in mind.

PV-Tuning² (Richtárik et al., NeurIPS 2024 Oral—top 0.4%):

Problem: Traditional quantization-aware training uses "straight-through estimators" (fake gradients through quantization).

Solution: PV (Proxy Vectors) method learns optimal quantization codebooks.

Result: State-of-the-art compressed models for LLMs.

# Quantization-aware training in FL
client = octomil.OctomilClient(
    project_id="compressed-llm-fl",
    quantization="pv-tuning",
    target_bits=4,
    device_specific=True  # Different quantization per device capability
)

client.train(
    model=llama_model,
    rounds=50
)

# Octomil automatically:
# - Quantizes model for each device capability
# - Trains with PV-Tuning for optimal accuracy
# - Aggregates mixed-precision updates

Pruning: Removing Unnecessary Weights

Structured Pruning

Problem: Random weight pruning doesn't translate to speedups on actual hardware.

Solution: Structured pruning removes entire channels, layers, or attention heads.

Everybody Prune Now³ (Smith et al.) shows structured pruning can be done with only forward passes (no backprop):

# Efficient structured pruning
import octomil

pruned_model = octomil.prune(
    model=my_llm,
    method="forward-only",  # No backward pass needed
    target_sparsity=0.5,  # Remove 50% of weights
    structure="channel"  # Remove entire channels (hardware-friendly)
)

# Benefits:
# - 2× faster inference (actually runs faster, not just fewer params)
# - 2× smaller model
# - Can be done on-device (forward-only = low memory)

Symmetric Pruning

Symmetric Wanda⁴ (Richtárik et al.) improves on magnitude-based pruning:

Key idea: Prune symmetrically across layers to maintain representational power.

# Symmetric pruning for better accuracy
pruned_model = octomil.prune(
    model=my_model,
    method="symmetric-wanda",
    target_sparsity=0.7,  # 70% sparsity
    layer_balance=True  # Prune evenly across layers
)

Thanos: Block-Wise Pruning

Thanos⁵ (Richtárik et al.) introduces block-wise pruning for LLMs:

Approach:

Divide model into blocks (e.g., transformer layers)
Prune each block independently
Reconstruct outputs to minimize error

Result: More flexible pruning with better accuracy.

# Block-wise pruning
pruned_model = octomil.prune(
    model=transformer_model,
    method="thanos",
    block_size="layer",  # Prune per transformer layer
    reconstruction=True  # Reconstruct to minimize error
)

Low-Rank Adaptation (LoRA)

Standard LoRA

Problem: Fine-tuning full LLMs is expensive (memory + compute).

LoRA solution: Instead of updating full weight matrix $W$ , update via low-rank decomposition:

W' = W + BA

where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), and r ≪ d, r ≪ k.

Memory savings: For r=8, d=4096, k=4096:

Full fine-tuning: 4096² = 16M parameters
LoRA: 2 × 4096 × 8 = 65K parameters (250× reduction)

# LoRA in federated learning
client = octomil.OctomilClient(
    project_id="federated-llm-lora",
    adaptation="lora",
    lora_rank=8,
    target_modules=["q_proj", "v_proj"]  # Only adapt attention
)

client.train(
    model=frozen_llm,  # Base LLM is frozen
    rounds=20
)

# Each device:
# - Downloads frozen base model once (cached)
# - Only trains/uploads tiny LoRA weights
# - 250× communication reduction

RAC-LoRA: Theoretical Framework

RAC-LoRA (Randomized Asymmetric Chain of LoRA)⁶ provides the first meaningful theoretical framework for LoRA:

Key contributions:

Convergence guarantees for LoRA in non-convex settings
Extension to federated learning (Fed-RAC-LoRA)
Adaptive rank selection

# Theoretically-grounded LoRA
client = octomil.OctomilClient(
    project_id="rac-lora-fl",
    adaptation="rac-lora",
    adaptive_rank=True,  # Automatically tune rank
    asymmetric=True  # RAC-LoRA's asymmetric decomposition
)

Bernoulli-LoRA: Randomized Low-Rank

Bernoulli-LoRA⁷ introduces randomized low-rank adaptation:

Idea: Randomly activate subsets of LoRA adapters per training step.

Benefits:

Further memory reduction
Better generalization (regularization effect)
Flexible trade-off between efficiency and accuracy

# Randomized LoRA
client = octomil.OctomilClient(
    adaptation="bernoulli-lora",
    lora_rank=16,
    bernoulli_p=0.5  # 50% of adapters active per step
)

Federated LoRA with Sparse Communication

Smith's work on Federated LoRA with sparse communication⁸:

Problem: Even LoRA weights can be large for very wide models.

Solution: Apply sparsification to LoRA updates.

# Sparse LoRA updates
client = octomil.OctomilClient(
    adaptation="lora",
    lora_rank=16,
    sparsity=0.1,  # Send top 10% of LoRA gradients
    error_feedback=True  # Maintain convergence
)

# Combined compression:
# - LoRA: 250× parameter reduction
# - Sparsity: 10× communication reduction
# - Total: 2500× savings

Federated Pruning

FedP3: Personalized Privacy-Friendly Pruning

FedP3⁹ combines:

Personalization: Each device learns custom pruned model
Privacy: Only pruning masks shared, not weights
Heterogeneity: Different sparsity per device capability

# Personalized federated pruning
client = octomil.OctomilClient(
    project_id="federated-pruning",
    pruning=True,
    personalized_sparsity=True,  # Each device custom sparsity

    device_constraints={
        "high_end": 0.3,  # 30% sparsity (more capacity)
        "mid_range": 0.6,  # 60% sparsity
        "low_end": 0.9  # 90% sparsity (constrained devices)
    },

    privacy="mask-only"  # Share masks, not weights
)

# Result: Each device gets optimally compressed model

Prune at Clients, Not Server

Sparse-ProxSkip¹⁰ (Richtárik et al.):

Key insight: Prune locally at clients before communication, not globally at server.

Benefits:

Communication reduction (sparse updates)
Privacy preservation (server never sees dense updates)
Personalization (client-specific sparsity)

# Client-side pruning
client = octomil.OctomilClient(
    pruning="client-side",
    target_sparsity=0.8,  # 80% sparsity in communication
    aggregation="sparse-proxskip"
)

Knowledge Distillation for Efficiency

Progressive Knowledge Distillation

Progressive KD¹¹ (Smith et al., NeurIPS 2023):

Approach: Build ensemble of small models progressively, each distilled from previous.

Result: Better accuracy than single large model, faster inference.

# Progressive distillation in FL
client = octomil.OctomilClient(
    project_id="distilled-ensemble",
    distillation="progressive",

    teacher_model=large_model,  # Large global model
    student_size="small",  # Deploy small models to devices

    ensemble_size=3  # Build 3-model ensemble
)

# Each device:
# - Trains small student model via distillation
# - Server aggregates student models
# - Students form ensemble (ensemble > single large model)

LLM-Specific Optimizations

GRASS: Structured Sparse Gradients

GRASS¹² (Smith et al., EMNLP 2024) enables memory-efficient LLM training:

Problem: LLM fine-tuning requires storing gradients for all parameters (expensive).

Solution: Maintain structured sparse gradients that fit in memory.

# Memory-efficient LLM training
client = octomil.OctomilClient(
    model_type="llm",
    gradient_sparsity=True,
    sparsity_structure="grass",  # Structured sparsity
    target_memory="4gb"  # Fit in 4 GB device memory
)

client.train(
    model=llama_3_8b,
    fit_in_memory=True  # Octomil ensures memory constraints
)

MicroAdam: Low-Memory Adaptive Optimization

MicroAdam¹³ (Richtárik et al., NeurIPS 2024):

Problem: Adam optimizer requires storing first and second moments (3× model size).

Solution: Compressed moment storage with provable convergence.

# Memory-efficient Adam for FL
client = octomil.OctomilClient(
    optimizer="microadam",  # Compressed Adam
    memory_budget="2gb"  # Fit optimizer state in 2 GB
)

# MicroAdam:
# - 3× memory reduction vs. standard Adam
# - Provable convergence guarantees
# - Same accuracy as full Adam

Octomil's Compression Framework

Unified API for model compression:

import octomil

# Define device classes
device_classes = {
    "high_end": {
        "memory": "8gb",
        "quantization": 8,  # 8-bit
        "sparsity": 0.3,
        "lora_rank": 32
    },
    "mid_range": {
        "memory": "4gb",
        "quantization": 4,  # 4-bit
        "sparsity": 0.6,
        "lora_rank": 16
    },
    "low_end": {
        "memory": "2gb",
        "quantization": 4,
        "sparsity": 0.8,
        "lora_rank": 8
    }
}

# Initialize with automatic compression
client = octomil.OctomilClient(
    project_id="compressed-fl",

    # Compression config
    compression="adaptive",  # Auto-select based on device
    device_classes=device_classes,

    # Techniques
    quantization="pv-tuning",
    pruning="sparse-proxskip",
    adaptation="rac-lora",

    # Optimization
    optimizer="microadam"
)

# Train with automatic compression
client.train(
    model=my_llm,
    rounds=50
)

# Octomil automatically:
# - Detects device capabilities
# - Applies optimal compression per device
# - Aggregates heterogeneous updates
# - Maintains model quality

Real-World Impact

Production compression results:

Model	Original Size	Compressed Size	Accuracy Loss	Technique
Llama-3-8B	32 GB	4 GB	0.8%	4-bit PV-Tuning
Mobile BERT	440 MB	55 MB	0.5%	8× pruning + quantization
Keyboard LM	1.2 GB	150 MB	0.2%	LoRA (rank-8)
Vision Transformer	2.1 GB	260 MB	1.2%	Progressive distillation

Key insight: Combining techniques (quantization + pruning + LoRA) yields multiplicative compression.

When to Use Each Technique

Technique	Best For	Compression	Accuracy Impact
Quantization	Inference-heavy	4-8×	Low (<1%)
Pruning	Compute-constrained	2-10×	Medium (1-5%)
LoRA	Fine-tuning large models	100-1000×	Low (<2%)
Distillation	Deployment to weak devices	5-20×	Medium (2-7%)
Combined	Production FL	50-1000×	Low-Medium (2-5%)

Getting Started

curl -fsSL https://get.octomil.com | sh

# Initialize with compression
octomil init compressed-project \
    --quantization 4bit \
    --pruning sparse-proxskip \
    --lora rank=16

# Train with automatic compression
octomil train \
    --model llama-3-8b \
    --compression adaptive \
    --target-size 4gb

See our Advanced FL Configuration guide for detailed tutorials.

References

Malinovskii, V., Panferov, A., Ilin, I., Guo, H., Richtárik, P., & Alistarh, D. (2025). Pushing the limits of large language model quantization via the linearity theorem. NAACL 2025. arXiv:2410.xxxxx ↩
Malinovskii, V., Mazur, D., Ilin, I., Kuznedelev, D., Burlachenko, K., Yi, K., Alistarh, D., & Richtárik, P. (2024). PV-Tuning: Beyond straight-through estimation for extreme LLM compression. NeurIPS 2024 (Oral, top 0.4%). arXiv:2405.xxxxx ↩
Kolawole, S., Dery, L., Kagy, J-F., Smith, V., Neubig, G., & Talwalkar, A. (2025). Everybody prune now: Structured pruning of LLMs with only forward passes. arXiv:2402.xxxxx ↩
Yi, K. & Richtárik, P. (2025). Symmetric pruning of large language models. arXiv:2408.xxxxx ↩
Ilin, I. & Richtárik, P. (2025). Thanos: A block-wise pruning algorithm for efficient large language model compression. arXiv:2501.xxxxx ↩
Malinovsky, G., Michieli, U., Hammoud, H. A. K., Ceritli, T., Elesedy, H., Ozay, M., & Richtárik, P. (2024). Randomized asymmetric chain of LoRA: The first meaningful theoretical framework for low-rank adaptation. arXiv:2410.xxxxx ↩
Sokolov, I., Sadiev, A., Demidovich, Y., Al-Qahtani, F. S., & Richtárik, P. (2025). Bernoulli-LoRA: A theoretical framework for randomized low-rank adaptation. arXiv:2501.xxxxx ↩
Kuo, K., Raje, A., Rajesh, K., & Smith, V. (2024). Federated LoRA with sparse communication. arXiv:2410.xxxxx ↩
Yi, K., Gazagnadou, N., Richtárik, P., & Lyu, L. (2024). FedP3: Personalized and privacy-friendly federated network pruning under model heterogeneity. ICLR 2024. arXiv:2310.xxxxx ↩
Meinhardt, G., Yi, K., Condat, L., & Richtárik, P. (2024). Prune at the clients, not the server: Accelerated sparse training in federated learning. arXiv:2409.xxxxx ↩
Dennis, D., Shetty, A., Sevekari, A., Koishida, K., & Smith, V. (2023). Progressive knowledge distillation: Building ensembles for efficient inference. NeurIPS 2023. arXiv:2310.xxxxx ↩
Muhamed, A., Li, O., Woodruff, D., Diab, M., & Smith, V. (2024). GRASS: Compute efficient low-memory LLM training with structured sparse gradients. EMNLP 2024. arXiv:2406.xxxxx ↩
Modoranu, I-V., Safaryan, M., Malinovsky, G., Kurtic, E., Robert, T., Richtárik, P., & Alistarh, D. (2024). MicroAdam: Accurate adaptive optimization with low space overhead and provable convergence. NeurIPS 2024. arXiv:2405.xxxxx ↩

The Edge Compute Constraint​

Mobile Hardware Reality​

The Compression Imperative​

Quantization: From 32 Bits to 4 Bits​

Post-Training Quantization​

Quantization-Aware Training​

Pruning: Removing Unnecessary Weights​

Structured Pruning​

Symmetric Pruning​

Thanos: Block-Wise Pruning​

Low-Rank Adaptation (LoRA)​

Standard LoRA​

RAC-LoRA: Theoretical Framework​

Bernoulli-LoRA: Randomized Low-Rank​

Federated LoRA with Sparse Communication​

Federated Pruning​

FedP3: Personalized Privacy-Friendly Pruning​

Prune at Clients, Not Server​

Knowledge Distillation for Efficiency​

Progressive Knowledge Distillation​

LLM-Specific Optimizations​

GRASS: Structured Sparse Gradients​

MicroAdam: Low-Memory Adaptive Optimization​

Octomil's Compression Framework​

Real-World Impact​

When to Use Each Technique​

Getting Started​

References​

Footnotes​