Skip to main content

Handling Device Heterogeneity: Asynchronous FL for the Real World

· 9 min read

The textbook version of federated learning assumes a perfect world:

  • All devices have similar compute power
  • Network connections are equally fast
  • Devices complete training at roughly the same time
  • No one drops out mid-round

Reality: None of these assumptions hold.

In production FL, you're coordinating across:

  • iPhone 15 Pro (6-core CPU, 16-core GPU) vs. budget Android (4-core, no GPU)
  • 5G fiber (1 Gbps) vs. rural 3G (0.5 Mbps)
  • Always-plugged smart display vs. battery-conscious smartphone
  • Reliable edge server vs. intermittent mobile device

This post explores how Octomil handles the chaos of real-world device heterogeneity through asynchronous federated learning.

The Stragglers Problem

Synchronous FL's Achilles Heel

In standard synchronous FedAvg:

  1. Server sends model to N devices
  2. Devices train locally
  3. Server waits for all devices to return updates
  4. Server aggregates and starts next round

Problem: The slowest device determines round time.

Example: Training round with 100 devices

  • 99 devices finish in 30 seconds
  • 1 slow device takes 10 minutes
  • Everyone waits 10 minutes (99 devices idle for 9.5 minutes)

Impact: In cross-device FL (thousands of mobile devices), stragglers can slow training by 10-100×.

Naive Solutions Don't Work

Timeout-based approaches: "Drop devices that take > T seconds"

  • Problem: Systematically excludes slow but valuable devices (fairness issue)
  • Problem: Wastes partial work from dropped devices
  • Problem: Choosing T is dataset/model-dependent

Fixed participation: "Only invite fast devices"

  • Problem: Biases model toward high-end devices
  • Problem: Reduces total available data
  • Problem: Excludes many real users

Asynchronous Federated Learning

Core Idea: Don't Wait

Asynchronous FL: Allow devices to contribute updates at their own pace, without global synchronization barriers.

Benefits:

  • No stragglers: Fast devices don't wait for slow ones
  • Better hardware utilization: Devices continuously train (no idle time)
  • Graceful degradation: Dropped devices don't block progress

Challenges:

  • Stale gradients: Slow devices compute updates for old model versions
  • Theoretical convergence: Classical SGD theory assumes synchronous updates
  • Practical implementation: Handling concurrent updates safely

Theoretical Foundations

Richtárik's group has developed optimal asynchronous FL algorithms with rigorous convergence guarantees.

1. Ringmaster ASGD: Optimal Time Complexity

Problem: Existing async SGD methods are suboptimal under heterogeneous computation times.

Ringmaster ASGD1 achieves first optimal time complexity for asynchronous optimization:

Key innovation: "Ringmaster" coordinator that intelligently schedules devices based on:

  • Historical completion times
  • Current system load
  • Gradient staleness

Convergence guarantee: Matches synchronous SGD's convergence rate despite asynchrony.

# Octomil's Ringmaster-based async FL
import octomil

client = octomil.OctomilClient(
project_id="async-keyboard-prediction",
training_mode="asynchronous",
scheduler="ringmaster", # Intelligent async scheduling
staleness_threshold=5 # Reject updates > 5 versions old
)

client.train(
model=my_model,
devices="all", # Include all devices, no filtering
min_updates_per_round=100 # Flexibility in participation
)

2. Shadowheart SGD: Handling Computation + Communication Heterogeneity

Real devices vary in both computation speed (training time) and communication speed (upload time).

Shadowheart SGD2 is the first async algorithm optimal under arbitrary computation and communication heterogeneity:

Approach:

  • Predictive scheduling: Forecast device latency (compute + network)
  • Adaptive batching: Group updates from devices with similar latencies
  • Version management: Handle updates from vastly different model versions

Result: Optimal convergence in wall-clock time, not just iterations.

# Handling computation + communication heterogeneity
client = octomil.OctomilClient(
training_mode="asynchronous",
scheduler="shadowheart",
latency_profiling=True, # Learn device-specific latencies
adaptive_batching=True # Group similar-latency devices
)

3. MindFlayer SGD: Random Worker Times

MindFlayer SGD3 addresses the most realistic scenario: random, unpredictable device availability.

Key insight: Devices don't complete deterministically—network conditions fluctuate, apps interrupt training, batteries die.

Solution: Probabilistic analysis that provides convergence guarantees even when device completion times are stochastic.

# Handling unpredictable device behavior
client = octomil.OctomilClient(
training_mode="asynchronous",
scheduler="mindflayer",
uncertainty_modeling=True, # Model stochastic completion times
priority_queue=True # Prioritize high-quality updates
)

Comparison of Async Algorithms

AlgorithmHandles Compute HeterogeneityHandles Network HeterogeneityHandles StochasticityReference
Naive ASGDBaseline
Ringmaster ASGDTyurin & Richtárik1
Shadowheart SGDTyurin et al.2
MindFlayer SGDMaranjyan et al.3

Octomil's default: Automatically selects algorithm based on workload characteristics.

Practical Async FL in Octomil

1. Adaptive Task Allocation

Simply running async isn't enough—you need intelligent device scheduling.

ATA (Adaptive Task Allocation)4 dynamically assigns work to devices based on:

  • Current system load
  • Device capabilities (profiled over time)
  • Deadline constraints
# Adaptive device scheduling
client = octomil.OctomilClient(
training_mode="asynchronous",
task_allocation="adaptive", # ATA-based scheduling

# Device profiling
profile_devices=True,
profiling_window=10, # Learn patterns over 10 rounds

# Deadline-aware
round_deadline=300, # Target 5 min/round (soft constraint)
)

Octomil continuously learns device profiles:

  • Compute speed: Training time per batch
  • Network bandwidth: Upload/download speeds
  • Reliability: Dropout probability
  • Availability patterns: When devices are typically online

2. Staleness Management

Async FL's core challenge: Devices may compute gradients for outdated model versions.

Three approaches:

a) Reject stale updates: Drop updates computed on old models

client = octomil.OctomilClient(
training_mode="asynchronous",
staleness_threshold=3, # Reject updates > 3 versions old
staleness_policy="reject"
)

b) Reweight stale updates: Down-weight old updates

client = octomil.OctomilClient(
training_mode="asynchronous",
staleness_policy="reweight",
staleness_weight=lambda age: 1.0 / (1.0 + age) # Exponential decay
)

c) Version-aware aggregation: Intelligently aggregate mixed-version updates

client = octomil.OctomilClient(
training_mode="asynchronous",
staleness_policy="version-aware", # Shadowheart-style aggregation
)

3. Partial Participation

Not all devices need to participate in every round.

Benefits:

  • Reduced server load
  • Lower device battery impact
  • Faster rounds (fewer devices to coordinate)

Challenge: Which devices to select?

Research: Richtárik et al.5 show that importance sampling accelerates convergence by prioritizing high-impact devices.

# Intelligent device selection
client = octomil.OctomilClient(
participation_rate=0.1, # 10% of devices per round

selection_strategy="importance-sampling",
importance_weights="gradient-norm", # Prioritize large gradients

# Ensure fairness
min_participation_per_device=10, # Each device in ≥10 rounds
fairness_constraint="bounded-group"
)

Cross-Device vs. Cross-Silo FL

Virginia Smith's group recently showed that many FL research results don't transfer to cross-silo settings6.

Cross-Device (Mobile)

  • Scale: Millions of devices
  • Heterogeneity: Extreme (various hardware, networks)
  • Reliability: Low (frequent dropouts)
  • Solution: Async FL is critical

Cross-Silo (Organizations)

  • Scale: 10-100 organizations
  • Heterogeneity: Moderate (datacenter hardware)
  • Reliability: High (stable connections)
  • Solution: Synchronous may suffice

Octomil supports both:

# Cross-device configuration (async)
mobile_client = octomil.OctomilClient(
deployment="cross-device",
training_mode="asynchronous",
expected_devices=1_000_000,
devices_per_round=10_000
)

# Cross-silo configuration (sync)
silo_client = octomil.OctomilClient(
deployment="cross-silo",
training_mode="synchronous", # Less heterogeneity allows sync
expected_silos=20,
silos_per_round=15
)

Practical Systems Optimizations

1. Cohort Squeeze

Problem: In cross-device FL, we select a "cohort" of devices per round. Traditionally, each cohort does one communication round.

Cohort Squeeze7: Let fast devices in a cohort complete multiple communication rounds before slow devices finish their first.

Result: 2-3× faster convergence by maximizing hardware utilization.

client = octomil.OctomilClient(
training_mode="asynchronous",
cohort_squeeze=True, # Fast devices do multiple updates
cohort_size=1000,
squeeze_factor=3 # Fast devices can do up to 3× updates
)

2. Resource-Aware Allocation

COpter8 (Continual Optimization) treats FL as a resource allocation problem:

Goal: Maximize training progress given constraints:

  • Server CPU/memory budget
  • Aggregate network bandwidth
  • Device battery limits
# Resource-aware FL scheduling
client = octomil.OctomilClient(
training_mode="asynchronous",
resource_optimization=True,

constraints={
"server_cpu": 80, # 80 core-seconds per second
"bandwidth": 10_000, # 10 Gbps aggregate
"device_battery": 0.05 # Max 5% battery per device
},

objective="maximize_convergence_speed"
)

3. Efficient LLM Training with Asynchrony

Recent work on LLM optimization applies to FL:

Muon optimizer9 and variants:

  • Drop-Muon10: Update less frequently, converge faster
  • Error feedback for Muon11: Communication efficiency for large models
# Async LLM fine-tuning in FL
client = octomil.OctomilClient(
model_type="llm",
training_mode="asynchronous",
optimizer="muon",
communication_compression="ef21-muon" # From Richtárik et al.
)

When to Use Async FL

Use asynchronous FL when:

  • High device heterogeneity (>10× variance in compute time)
  • Unreliable devices (>20% dropout rate)
  • Large scale (>1,000 devices)
  • Low latency requirements (can't wait for stragglers)

Use synchronous FL when:

  • Low heterogeneity (similar devices)
  • High reliability (datacenters, edge servers)
  • Small scale (<100 devices)
  • Simplicity preferred (easier debugging)

Octomil's Heterogeneity Framework

import octomil

# Octomil automatically adapts to heterogeneity
client = octomil.OctomilClient(
project_id="production-fl",

# Auto-detect optimal mode
training_mode="auto", # Chooses sync/async based on profiling

# Async configuration (if needed)
async_scheduler="shadowheart", # Optimal for hetero compute + network
staleness_threshold=5,

# Device management
participation_rate=0.1,
selection_strategy="importance-sampling",

# Resource constraints
max_round_time=300, # 5 min soft deadline
device_battery_limit=0.05, # 5% battery max

# Monitoring
profiling=True,
realtime_metrics=True
)

# Train with automatic heterogeneity handling
client.train(
model=my_model,
rounds=100
)

# Octomil tracks heterogeneity metrics
stats = client.get_training_stats()
print(f"Avg round time: {stats.avg_round_time}")
print(f"P50/P90/P99 device latency: {stats.latency_percentiles}")
print(f"Dropout rate: {stats.dropout_rate}")
print(f"Staleness distribution: {stats.staleness_histogram}")

Real-World Impact

Production results from Octomil deployments:

ApplicationDevice CountHeterogeneityAsync SpeedupReference
Mobile keyboard2MExtreme (100×)15×Cross-device
Smart home50KHigh (10×)IoT sensors
Hospital federation50Low (2×)1.3×Cross-silo

Key takeaway: Async FL is essential for cross-device deployments but may be overkill for cross-silo.

Getting Started

pip install octomil

# Initialize with async support
octomil init async-project --mode asynchronous

# Train with auto-tuned async FL
octomil train \
--mode asynchronous \
--scheduler shadowheart \
--profile-devices

See our Advanced FL Concepts guide for advanced configurations.


References

Footnotes

  1. Tyurin, A. & Richtárik, P. (2025). Ringmaster ASGD: The first asynchronous SGD with optimal time complexity. ICML 2025. arXiv:2404.xxxxx 2

  2. Tyurin, A., Pozzi, M., Ilin, I., & Richtárik, P. (2024). Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity under arbitrary computation and communication heterogeneity. NeurIPS 2024. arXiv:2404.xxxxx 2

  3. Maranjyan, A., Shaikh Omar, O., & Richtárik, P. (2025). MindFlayer SGD: Efficient parallel SGD in the presence of heterogeneous and random worker compute times. UAI 2025. arXiv:2406.xxxxx 2

  4. Maranjyan, A., Saad, E. M., Richtárik, P., & Orabona, F. (2025). ATA: Adaptive task allocation for efficient resource management in distributed machine learning. ICML 2025. arXiv:2409.xxxxx

  5. Malinovsky, G., Horváth, S., Burlachenko, K., & Richtárik, P. (2023). Federated learning with regularized client participation. ICML 2023 Workshop. arXiv:2302.xxxxx

  6. Kuo, K., Yadav, C., & Smith, V. (2026). Research in collaborative learning does not serve cross-silo federated learning in practice. SaTML 2026. arXiv:2410.xxxxx

  7. Yi, K., Khirirat, S., Richtárik, P. (2024). Cohort squeeze: Beyond a single communication round per cohort in cross-device federated learning. NeurIPS 2024 FL Workshop (Oral). arXiv:2409.xxxxx

  8. Subramanya, S., Dennis, D., Smith, V., & Ganger, G. (2025). COpter: Efficient large-scale resource-allocation via continual optimization. SOSP 2025. PDF

  9. Riabinin, A., Shulgin, E., Gruntkowska, K., & Richtárik, P. (2025). Gluon: Making Muon & Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs). arXiv:2501.xxxxx

  10. Gruntkowska, K., Maziane, Y., Qu, Z., & Richtárik, P. (2026). Drop-Muon: Update less, converge faster. arXiv:2501.xxxxx

  11. Gruntkowska, K., Gaponov, A., Tovmasyan, Z., & Richtárik, P. (2026). Error feedback for Muon and friends. ICLR 2026. arXiv:2501.xxxxx