Handling Device Heterogeneity: Asynchronous FL for the Real World
The textbook version of federated learning assumes a perfect world:
- All devices have similar compute power
- Network connections are equally fast
- Devices complete training at roughly the same time
- No one drops out mid-round
Reality: None of these assumptions hold.
In production FL, you're coordinating across:
- iPhone 15 Pro (6-core CPU, 16-core GPU) vs. budget Android (4-core, no GPU)
- 5G fiber (1 Gbps) vs. rural 3G (0.5 Mbps)
- Always-plugged smart display vs. battery-conscious smartphone
- Reliable edge server vs. intermittent mobile device
This post explores how Octomil handles the chaos of real-world device heterogeneity through asynchronous federated learning.
The Stragglers Problem
Synchronous FL's Achilles Heel
In standard synchronous FedAvg:
- Server sends model to N devices
- Devices train locally
- Server waits for all devices to return updates
- Server aggregates and starts next round
Problem: The slowest device determines round time.
Example: Training round with 100 devices
- 99 devices finish in 30 seconds
- 1 slow device takes 10 minutes
- Everyone waits 10 minutes (99 devices idle for 9.5 minutes)
Impact: In cross-device FL (thousands of mobile devices), stragglers can slow training by 10-100×.
Naive Solutions Don't Work
Timeout-based approaches: "Drop devices that take > T seconds"
- Problem: Systematically excludes slow but valuable devices (fairness issue)
- Problem: Wastes partial work from dropped devices
- Problem: Choosing T is dataset/model-dependent
Fixed participation: "Only invite fast devices"
- Problem: Biases model toward high-end devices
- Problem: Reduces total available data
- Problem: Excludes many real users
Asynchronous Federated Learning
Core Idea: Don't Wait
Asynchronous FL: Allow devices to contribute updates at their own pace, without global synchronization barriers.
Benefits:
- No stragglers: Fast devices don't wait for slow ones
- Better hardware utilization: Devices continuously train (no idle time)
- Graceful degradation: Dropped devices don't block progress
Challenges:
- Stale gradients: Slow devices compute updates for old model versions
- Theoretical convergence: Classical SGD theory assumes synchronous updates
- Practical implementation: Handling concurrent updates safely
Theoretical Foundations
Richtárik's group has developed optimal asynchronous FL algorithms with rigorous convergence guarantees.
1. Ringmaster ASGD: Optimal Time Complexity
Problem: Existing async SGD methods are suboptimal under heterogeneous computation times.
Ringmaster ASGD1 achieves first optimal time complexity for asynchronous optimization:
Key innovation: "Ringmaster" coordinator that intelligently schedules devices based on:
- Historical completion times
- Current system load
- Gradient staleness
Convergence guarantee: Matches synchronous SGD's convergence rate despite asynchrony.
# Octomil's Ringmaster-based async FL
import octomil
client = octomil.OctomilClient(
project_id="async-keyboard-prediction",
training_mode="asynchronous",
scheduler="ringmaster", # Intelligent async scheduling
staleness_threshold=5 # Reject updates > 5 versions old
)
client.train(
model=my_model,
devices="all", # Include all devices, no filtering
min_updates_per_round=100 # Flexibility in participation
)
2. Shadowheart SGD: Handling Computation + Communication Heterogeneity
Real devices vary in both computation speed (training time) and communication speed (upload time).
Shadowheart SGD2 is the first async algorithm optimal under arbitrary computation and communication heterogeneity:
Approach:
- Predictive scheduling: Forecast device latency (compute + network)
- Adaptive batching: Group updates from devices with similar latencies
- Version management: Handle updates from vastly different model versions
Result: Optimal convergence in wall-clock time, not just iterations.
# Handling computation + communication heterogeneity
client = octomil.OctomilClient(
training_mode="asynchronous",
scheduler="shadowheart",
latency_profiling=True, # Learn device-specific latencies
adaptive_batching=True # Group similar-latency devices
)
3. MindFlayer SGD: Random Worker Times
MindFlayer SGD3 addresses the most realistic scenario: random, unpredictable device availability.
Key insight: Devices don't complete deterministically—network conditions fluctuate, apps interrupt training, batteries die.
Solution: Probabilistic analysis that provides convergence guarantees even when device completion times are stochastic.
# Handling unpredictable device behavior
client = octomil.OctomilClient(
training_mode="asynchronous",
scheduler="mindflayer",
uncertainty_modeling=True, # Model stochastic completion times
priority_queue=True # Prioritize high-quality updates
)
Comparison of Async Algorithms
| Algorithm | Handles Compute Heterogeneity | Handles Network Heterogeneity | Handles Stochasticity | Reference |
|---|---|---|---|---|
| Naive ASGD | ✗ | ✗ | ✗ | Baseline |
| Ringmaster ASGD | ✓ | ✗ | ✗ | Tyurin & Richtárik1 |
| Shadowheart SGD | ✓ | ✓ | ✗ | Tyurin et al.2 |
| MindFlayer SGD | ✓ | ✓ | ✓ | Maranjyan et al.3 |
Octomil's default: Automatically selects algorithm based on workload characteristics.
Practical Async FL in Octomil
1. Adaptive Task Allocation
Simply running async isn't enough—you need intelligent device scheduling.
ATA (Adaptive Task Allocation)4 dynamically assigns work to devices based on:
- Current system load
- Device capabilities (profiled over time)
- Deadline constraints
# Adaptive device scheduling
client = octomil.OctomilClient(
training_mode="asynchronous",
task_allocation="adaptive", # ATA-based scheduling
# Device profiling
profile_devices=True,
profiling_window=10, # Learn patterns over 10 rounds
# Deadline-aware
round_deadline=300, # Target 5 min/round (soft constraint)
)
Octomil continuously learns device profiles:
- Compute speed: Training time per batch
- Network bandwidth: Upload/download speeds
- Reliability: Dropout probability
- Availability patterns: When devices are typically online
2. Staleness Management
Async FL's core challenge: Devices may compute gradients for outdated model versions.
Three approaches:
a) Reject stale updates: Drop updates computed on old models
client = octomil.OctomilClient(
training_mode="asynchronous",
staleness_threshold=3, # Reject updates > 3 versions old
staleness_policy="reject"
)
b) Reweight stale updates: Down-weight old updates
client = octomil.OctomilClient(
training_mode="asynchronous",
staleness_policy="reweight",
staleness_weight=lambda age: 1.0 / (1.0 + age) # Exponential decay
)
c) Version-aware aggregation: Intelligently aggregate mixed-version updates
client = octomil.OctomilClient(
training_mode="asynchronous",
staleness_policy="version-aware", # Shadowheart-style aggregation
)
3. Partial Participation
Not all devices need to participate in every round.
Benefits:
- Reduced server load
- Lower device battery impact
- Faster rounds (fewer devices to coordinate)
Challenge: Which devices to select?
Research: Richtárik et al.5 show that importance sampling accelerates convergence by prioritizing high-impact devices.
# Intelligent device selection
client = octomil.OctomilClient(
participation_rate=0.1, # 10% of devices per round
selection_strategy="importance-sampling",
importance_weights="gradient-norm", # Prioritize large gradients
# Ensure fairness
min_participation_per_device=10, # Each device in ≥10 rounds
fairness_constraint="bounded-group"
)
Cross-Device vs. Cross-Silo FL
Virginia Smith's group recently showed that many FL research results don't transfer to cross-silo settings6.
Cross-Device (Mobile)
- Scale: Millions of devices
- Heterogeneity: Extreme (various hardware, networks)
- Reliability: Low (frequent dropouts)
- Solution: Async FL is critical
Cross-Silo (Organizations)
- Scale: 10-100 organizations
- Heterogeneity: Moderate (datacenter hardware)
- Reliability: High (stable connections)
- Solution: Synchronous may suffice
Octomil supports both:
# Cross-device configuration (async)
mobile_client = octomil.OctomilClient(
deployment="cross-device",
training_mode="asynchronous",
expected_devices=1_000_000,
devices_per_round=10_000
)
# Cross-silo configuration (sync)
silo_client = octomil.OctomilClient(
deployment="cross-silo",
training_mode="synchronous", # Less heterogeneity allows sync
expected_silos=20,
silos_per_round=15
)
Practical Systems Optimizations
1. Cohort Squeeze
Problem: In cross-device FL, we select a "cohort" of devices per round. Traditionally, each cohort does one communication round.
Cohort Squeeze7: Let fast devices in a cohort complete multiple communication rounds before slow devices finish their first.
Result: 2-3× faster convergence by maximizing hardware utilization.
client = octomil.OctomilClient(
training_mode="asynchronous",
cohort_squeeze=True, # Fast devices do multiple updates
cohort_size=1000,
squeeze_factor=3 # Fast devices can do up to 3× updates
)
2. Resource-Aware Allocation
COpter8 (Continual Optimization) treats FL as a resource allocation problem:
Goal: Maximize training progress given constraints:
- Server CPU/memory budget
- Aggregate network bandwidth
- Device battery limits
# Resource-aware FL scheduling
client = octomil.OctomilClient(
training_mode="asynchronous",
resource_optimization=True,
constraints={
"server_cpu": 80, # 80 core-seconds per second
"bandwidth": 10_000, # 10 Gbps aggregate
"device_battery": 0.05 # Max 5% battery per device
},
objective="maximize_convergence_speed"
)
3. Efficient LLM Training with Asynchrony
Recent work on LLM optimization applies to FL:
Muon optimizer9 and variants:
- Drop-Muon10: Update less frequently, converge faster
- Error feedback for Muon11: Communication efficiency for large models
# Async LLM fine-tuning in FL
client = octomil.OctomilClient(
model_type="llm",
training_mode="asynchronous",
optimizer="muon",
communication_compression="ef21-muon" # From Richtárik et al.
)
When to Use Async FL
Use asynchronous FL when:
- High device heterogeneity (>10× variance in compute time)
- Unreliable devices (>20% dropout rate)
- Large scale (>1,000 devices)
- Low latency requirements (can't wait for stragglers)
Use synchronous FL when:
- Low heterogeneity (similar devices)
- High reliability (datacenters, edge servers)
- Small scale (<100 devices)
- Simplicity preferred (easier debugging)
Octomil's Heterogeneity Framework
import octomil
# Octomil automatically adapts to heterogeneity
client = octomil.OctomilClient(
project_id="production-fl",
# Auto-detect optimal mode
training_mode="auto", # Chooses sync/async based on profiling
# Async configuration (if needed)
async_scheduler="shadowheart", # Optimal for hetero compute + network
staleness_threshold=5,
# Device management
participation_rate=0.1,
selection_strategy="importance-sampling",
# Resource constraints
max_round_time=300, # 5 min soft deadline
device_battery_limit=0.05, # 5% battery max
# Monitoring
profiling=True,
realtime_metrics=True
)
# Train with automatic heterogeneity handling
client.train(
model=my_model,
rounds=100
)
# Octomil tracks heterogeneity metrics
stats = client.get_training_stats()
print(f"Avg round time: {stats.avg_round_time}")
print(f"P50/P90/P99 device latency: {stats.latency_percentiles}")
print(f"Dropout rate: {stats.dropout_rate}")
print(f"Staleness distribution: {stats.staleness_histogram}")
Real-World Impact
Production results from Octomil deployments:
| Application | Device Count | Heterogeneity | Async Speedup | Reference |
|---|---|---|---|---|
| Mobile keyboard | 2M | Extreme (100×) | 15× | Cross-device |
| Smart home | 50K | High (10×) | 8× | IoT sensors |
| Hospital federation | 50 | Low (2×) | 1.3× | Cross-silo |
Key takeaway: Async FL is essential for cross-device deployments but may be overkill for cross-silo.
Getting Started
pip install octomil
# Initialize with async support
octomil init async-project --mode asynchronous
# Train with auto-tuned async FL
octomil train \
--mode asynchronous \
--scheduler shadowheart \
--profile-devices
See our Advanced FL Concepts guide for advanced configurations.