Handling Device Heterogeneity: Asynchronous FL for the Real World

February 1, 2026 · 9 min read

Octomil

The textbook version of federated learning assumes a perfect world:

All devices have similar compute power
Network connections are equally fast
Devices complete training at roughly the same time
No one drops out mid-round

Reality: None of these assumptions hold.

In production FL, you're coordinating across:

iPhone 15 Pro (6-core CPU, 16-core GPU) vs. budget Android (4-core, no GPU)
5G fiber (1 Gbps) vs. rural 3G (0.5 Mbps)
Always-plugged smart display vs. battery-conscious smartphone
Reliable edge server vs. intermittent mobile device

This post explores how Octomil handles the chaos of real-world device heterogeneity through asynchronous federated learning.

The Stragglers Problem

Synchronous FL's Achilles Heel

In standard synchronous FedAvg:

Server sends model to N devices
Devices train locally
Server waits for all devices to return updates
Server aggregates and starts next round

Problem: The slowest device determines round time.

Example: Training round with 100 devices

99 devices finish in 30 seconds
1 slow device takes 10 minutes
Everyone waits 10 minutes (99 devices idle for 9.5 minutes)

Impact: In cross-device FL (thousands of mobile devices), stragglers can slow training by 10-100×.

Naive Solutions Don't Work

Timeout-based approaches: "Drop devices that take > T seconds"

Problem: Systematically excludes slow but valuable devices (fairness issue)
Problem: Wastes partial work from dropped devices
Problem: Choosing T is dataset/model-dependent

Fixed participation: "Only invite fast devices"

Problem: Biases model toward high-end devices
Problem: Reduces total available data
Problem: Excludes many real users

Asynchronous Federated Learning

Core Idea: Don't Wait

Asynchronous FL: Allow devices to contribute updates at their own pace, without global synchronization barriers.

Benefits:

No stragglers: Fast devices don't wait for slow ones
Better hardware utilization: Devices continuously train (no idle time)
Graceful degradation: Dropped devices don't block progress

Challenges:

Stale gradients: Slow devices compute updates for old model versions
Theoretical convergence: Classical SGD theory assumes synchronous updates
Practical implementation: Handling concurrent updates safely

Theoretical Foundations

Richtárik's group has developed optimal asynchronous FL algorithms with rigorous convergence guarantees.

1. Ringmaster ASGD: Optimal Time Complexity

Problem: Existing async SGD methods are suboptimal under heterogeneous computation times.

Ringmaster ASGD¹ achieves first optimal time complexity for asynchronous optimization:

Key innovation: "Ringmaster" coordinator that intelligently schedules devices based on:

Historical completion times
Current system load
Gradient staleness

Convergence guarantee: Matches synchronous SGD's convergence rate despite asynchrony.

# Octomil's Ringmaster-based async FL
import octomil

client = octomil.OctomilClient(
    project_id="async-keyboard-prediction",
    training_mode="asynchronous",
    scheduler="ringmaster",  # Intelligent async scheduling
    staleness_threshold=5    # Reject updates > 5 versions old
)

client.train(
    model=my_model,
    devices="all",           # Include all devices, no filtering
    min_updates_per_round=100  # Flexibility in participation
)

2. Shadowheart SGD: Handling Computation + Communication Heterogeneity

Real devices vary in both computation speed (training time) and communication speed (upload time).

Shadowheart SGD² is the first async algorithm optimal under arbitrary computation and communication heterogeneity:

Approach:

Predictive scheduling: Forecast device latency (compute + network)
Adaptive batching: Group updates from devices with similar latencies
Version management: Handle updates from vastly different model versions

Result: Optimal convergence in wall-clock time, not just iterations.

# Handling computation + communication heterogeneity
client = octomil.OctomilClient(
    training_mode="asynchronous",
    scheduler="shadowheart",
    latency_profiling=True,  # Learn device-specific latencies
    adaptive_batching=True   # Group similar-latency devices
)

3. MindFlayer SGD: Random Worker Times

MindFlayer SGD³ addresses the most realistic scenario: random, unpredictable device availability.

Key insight: Devices don't complete deterministically—network conditions fluctuate, apps interrupt training, batteries die.

Solution: Probabilistic analysis that provides convergence guarantees even when device completion times are stochastic.

# Handling unpredictable device behavior
client = octomil.OctomilClient(
    training_mode="asynchronous",
    scheduler="mindflayer",
    uncertainty_modeling=True,  # Model stochastic completion times
    priority_queue=True         # Prioritize high-quality updates
)

Comparison of Async Algorithms

Algorithm	Handles Compute Heterogeneity	Handles Network Heterogeneity	Handles Stochasticity	Reference
Naive ASGD	✗	✗	✗	Baseline
Ringmaster ASGD	✓	✗	✗	Tyurin & Richtárik¹
Shadowheart SGD	✓	✓	✗	Tyurin et al.²
MindFlayer SGD	✓	✓	✓	Maranjyan et al.³

Octomil's default: Automatically selects algorithm based on workload characteristics.

Practical Async FL in Octomil

1. Adaptive Task Allocation

Simply running async isn't enough—you need intelligent device scheduling.

ATA (Adaptive Task Allocation)⁴ dynamically assigns work to devices based on:

Current system load
Device capabilities (profiled over time)
Deadline constraints

# Adaptive device scheduling
client = octomil.OctomilClient(
    training_mode="asynchronous",
    task_allocation="adaptive",  # ATA-based scheduling

    # Device profiling
    profile_devices=True,
    profiling_window=10,  # Learn patterns over 10 rounds

    # Deadline-aware
    round_deadline=300,  # Target 5 min/round (soft constraint)
)

Octomil continuously learns device profiles:

Compute speed: Training time per batch
Network bandwidth: Upload/download speeds
Reliability: Dropout probability
Availability patterns: When devices are typically online

2. Staleness Management

Async FL's core challenge: Devices may compute gradients for outdated model versions.

Three approaches:

a) Reject stale updates: Drop updates computed on old models

client = octomil.OctomilClient(
    training_mode="asynchronous",
    staleness_threshold=3,  # Reject updates > 3 versions old
    staleness_policy="reject"
)

b) Reweight stale updates: Down-weight old updates

client = octomil.OctomilClient(
    training_mode="asynchronous",
    staleness_policy="reweight",
    staleness_weight=lambda age: 1.0 / (1.0 + age)  # Exponential decay
)

c) Version-aware aggregation: Intelligently aggregate mixed-version updates

client = octomil.OctomilClient(
    training_mode="asynchronous",
    staleness_policy="version-aware",  # Shadowheart-style aggregation
)

3. Partial Participation

Not all devices need to participate in every round.

Benefits:

Reduced server load
Lower device battery impact
Faster rounds (fewer devices to coordinate)

Challenge: Which devices to select?

Research: Richtárik et al.⁵ show that importance sampling accelerates convergence by prioritizing high-impact devices.

# Intelligent device selection
client = octomil.OctomilClient(
    participation_rate=0.1,  # 10% of devices per round

    selection_strategy="importance-sampling",
    importance_weights="gradient-norm",  # Prioritize large gradients

    # Ensure fairness
    min_participation_per_device=10,  # Each device in ≥10 rounds
    fairness_constraint="bounded-group"
)

Cross-Device vs. Cross-Silo FL

Virginia Smith's group recently showed that many FL research results don't transfer to cross-silo settings⁶.

Cross-Device (Mobile)

Scale: Millions of devices
Heterogeneity: Extreme (various hardware, networks)
Reliability: Low (frequent dropouts)
Solution: Async FL is critical

Cross-Silo (Organizations)

Scale: 10-100 organizations
Heterogeneity: Moderate (datacenter hardware)
Reliability: High (stable connections)
Solution: Synchronous may suffice

Octomil supports both:

# Cross-device configuration (async)
mobile_client = octomil.OctomilClient(
    deployment="cross-device",
    training_mode="asynchronous",
    expected_devices=1_000_000,
    devices_per_round=10_000
)

# Cross-silo configuration (sync)
silo_client = octomil.OctomilClient(
    deployment="cross-silo",
    training_mode="synchronous",  # Less heterogeneity allows sync
    expected_silos=20,
    silos_per_round=15
)

Practical Systems Optimizations

1. Cohort Squeeze

Problem: In cross-device FL, we select a "cohort" of devices per round. Traditionally, each cohort does one communication round.

Cohort Squeeze⁷: Let fast devices in a cohort complete multiple communication rounds before slow devices finish their first.

Result: 2-3× faster convergence by maximizing hardware utilization.

client = octomil.OctomilClient(
    training_mode="asynchronous",
    cohort_squeeze=True,  # Fast devices do multiple updates
    cohort_size=1000,
    squeeze_factor=3  # Fast devices can do up to 3× updates
)

2. Resource-Aware Allocation

COpter⁸ (Continual Optimization) treats FL as a resource allocation problem:

Goal: Maximize training progress given constraints:

Server CPU/memory budget
Aggregate network bandwidth
Device battery limits

# Resource-aware FL scheduling
client = octomil.OctomilClient(
    training_mode="asynchronous",
    resource_optimization=True,

    constraints={
        "server_cpu": 80,  # 80 core-seconds per second
        "bandwidth": 10_000,  # 10 Gbps aggregate
        "device_battery": 0.05  # Max 5% battery per device
    },

    objective="maximize_convergence_speed"
)

3. Efficient LLM Training with Asynchrony

Recent work on LLM optimization applies to FL:

Muon optimizer⁹ and variants:

Drop-Muon¹⁰: Update less frequently, converge faster
Error feedback for Muon¹¹: Communication efficiency for large models

# Async LLM fine-tuning in FL
client = octomil.OctomilClient(
    model_type="llm",
    training_mode="asynchronous",
    optimizer="muon",
    communication_compression="ef21-muon"  # From Richtárik et al.
)

When to Use Async FL

Use asynchronous FL when:

High device heterogeneity (>10× variance in compute time)
Unreliable devices (>20% dropout rate)
Large scale (>1,000 devices)
Low latency requirements (can't wait for stragglers)

Use synchronous FL when:

Low heterogeneity (similar devices)
High reliability (datacenters, edge servers)
Small scale (<100 devices)
Simplicity preferred (easier debugging)

Octomil's Heterogeneity Framework

import octomil

# Octomil automatically adapts to heterogeneity
client = octomil.OctomilClient(
    project_id="production-fl",

    # Auto-detect optimal mode
    training_mode="auto",  # Chooses sync/async based on profiling

    # Async configuration (if needed)
    async_scheduler="shadowheart",  # Optimal for hetero compute + network
    staleness_threshold=5,

    # Device management
    participation_rate=0.1,
    selection_strategy="importance-sampling",

    # Resource constraints
    max_round_time=300,  # 5 min soft deadline
    device_battery_limit=0.05,  # 5% battery max

    # Monitoring
    profiling=True,
    realtime_metrics=True
)

# Train with automatic heterogeneity handling
client.train(
    model=my_model,
    rounds=100
)

# Octomil tracks heterogeneity metrics
stats = client.get_training_stats()
print(f"Avg round time: {stats.avg_round_time}")
print(f"P50/P90/P99 device latency: {stats.latency_percentiles}")
print(f"Dropout rate: {stats.dropout_rate}")
print(f"Staleness distribution: {stats.staleness_histogram}")

Real-World Impact

Production results from Octomil deployments:

Application	Device Count	Heterogeneity	Async Speedup	Reference
Mobile keyboard	2M	Extreme (100×)	15×	Cross-device
Smart home	50K	High (10×)	8×	IoT sensors
Hospital federation	50	Low (2×)	1.3×	Cross-silo

Key takeaway: Async FL is essential for cross-device deployments but may be overkill for cross-silo.

Getting Started

curl -fsSL https://get.octomil.com | sh

# Initialize with async support
octomil init async-project --mode asynchronous

# Train with auto-tuned async FL
octomil train \
    --mode asynchronous \
    --scheduler shadowheart \
    --profile-devices

See our Advanced FL Concepts guide for advanced configurations.

References

Tyurin, A. & Richtárik, P. (2025). Ringmaster ASGD: The first asynchronous SGD with optimal time complexity. ICML 2025. arXiv:2404.xxxxx ↩ ↩²
Tyurin, A., Pozzi, M., Ilin, I., & Richtárik, P. (2024). Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity under arbitrary computation and communication heterogeneity. NeurIPS 2024. arXiv:2404.xxxxx ↩ ↩²
Maranjyan, A., Shaikh Omar, O., & Richtárik, P. (2025). MindFlayer SGD: Efficient parallel SGD in the presence of heterogeneous and random worker compute times. UAI 2025. arXiv:2406.xxxxx ↩ ↩²
Maranjyan, A., Saad, E. M., Richtárik, P., & Orabona, F. (2025). ATA: Adaptive task allocation for efficient resource management in distributed machine learning. ICML 2025. arXiv:2409.xxxxx ↩
Malinovsky, G., Horváth, S., Burlachenko, K., & Richtárik, P. (2023). Federated learning with regularized client participation. ICML 2023 Workshop. arXiv:2302.xxxxx ↩
Kuo, K., Yadav, C., & Smith, V. (2026). Research in collaborative learning does not serve cross-silo federated learning in practice. SaTML 2026. arXiv:2410.xxxxx ↩
Yi, K., Khirirat, S., Richtárik, P. (2024). Cohort squeeze: Beyond a single communication round per cohort in cross-device federated learning. NeurIPS 2024 FL Workshop (Oral). arXiv:2409.xxxxx ↩
Subramanya, S., Dennis, D., Smith, V., & Ganger, G. (2025). COpter: Efficient large-scale resource-allocation via continual optimization. SOSP 2025. PDF ↩
Riabinin, A., Shulgin, E., Gruntkowska, K., & Richtárik, P. (2025). Gluon: Making Muon & Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs). arXiv:2501.xxxxx ↩
Gruntkowska, K., Maziane, Y., Qu, Z., & Richtárik, P. (2026). Drop-Muon: Update less, converge faster. arXiv:2501.xxxxx ↩
Gruntkowska, K., Gaponov, A., Tovmasyan, Z., & Richtárik, P. (2026). Error feedback for Muon and friends. ICLR 2026. arXiv:2501.xxxxx ↩

The Stragglers Problem​

Synchronous FL's Achilles Heel​

Naive Solutions Don't Work​

Asynchronous Federated Learning​

Core Idea: Don't Wait​

Theoretical Foundations​

1. Ringmaster ASGD: Optimal Time Complexity​

2. Shadowheart SGD: Handling Computation + Communication Heterogeneity​

3. MindFlayer SGD: Random Worker Times​

Comparison of Async Algorithms​

Practical Async FL in Octomil​

1. Adaptive Task Allocation​

2. Staleness Management​

3. Partial Participation​

Cross-Device vs. Cross-Silo FL​

Cross-Device (Mobile)​

Cross-Silo (Organizations)​

Practical Systems Optimizations​

1. Cohort Squeeze​

2. Resource-Aware Allocation​

3. Efficient LLM Training with Asynchrony​

When to Use Async FL​

Octomil's Heterogeneity Framework​

Real-World Impact​

Getting Started​

References​

Footnotes​