From Research to Production: How Octomil Implements SOTA Federated Learning
The federated learning research landscape is exploding. NeurIPS 2024 alone featured 100+ FL papers. ICML, ICLR, TMLR—every major venue now has substantial FL content.
But there's a chasm between research prototypes and production systems.
Research papers provide algorithms, convergence proofs, and benchmark results on MNIST/CIFAR. Production systems need to handle millions of mobile devices, unreliable networks, Byzantine attackers, GDPR compliance, and 99.9% uptime requirements.
Octomil bridges this gap. This post explains how we translate cutting-edge research into a platform you can pip install and deploy.
The Research-to-Production Gap
What Research Papers Provide
A typical FL paper includes:
- Algorithm: Pseudocode for a new training method
- Theory: Convergence rate analysis (e.g., O(1/√T))
- Experiments: Performance on standard benchmarks
- Code: Often a research prototype (if you're lucky)
Example: Richtárik et al.'s "Ringmaster ASGD"1
- Algorithm: 50 lines of pseudocode
- Theory: First optimal time complexity for async SGD
- Experiments: CIFAR-10, synthetic data
- Code: Research implementation in PyTorch
What Production Systems Need
To deploy FL at scale:
- Device management: Registration, authentication, heartbeats
- Model versioning: Track every model iteration, enable rollback
- Failure handling: Device dropouts, network errors, crashes
- Monitoring: Real-time dashboards, alerts, metrics
- Security: Byzantine robustness, differential privacy, access control
- Infrastructure: Load balancing, auto-scaling, multi-region
- Mobile SDKs: iOS (Swift, CoreML), Android (Kotlin, TFLite)
- API design: Simple, intuitive, well-documented
- Compliance: GDPR, HIPAA, SOC 2
Result: Deploying one research paper's algorithm can require 100× more engineering than the algorithm itself.
Octomil's Research-to-Production Pipeline
1. Paper Selection: 80/20 Rule
We track 50+ research groups and evaluate 200+ papers per year. Our filter:
Does this technique solve a problem Octomil users actually face?
We prioritize:
- Communication efficiency: Users pay for bandwidth
- Privacy: Regulations require strong guarantees
- Heterogeneity: Real devices vary wildly
- Compression: Edge devices have limited resources
- Fairness: Production systems serve diverse populations
We don't prioritize:
- Techniques that only work on IID data (unrealistic)
- Algorithms requiring specialized hardware (not accessible)
- Marginal improvements (1% gain not worth complexity)
Example: Richtárik et al.'s communication compression work (EF21, BiCoLoR, LoCoDL) directly addresses user pain points → Implemented.
2. Abstraction: Hide Complexity
Research algorithms have hyperparameters: learning rates, momentum, compression rates, clipping thresholds, staleness bounds.
Octomil's philosophy: Users shouldn't need a PhD to use SOTA techniques.
Our approach:
# Research paper implementation (complex)
optimizer = EFSGD(
params=model.parameters(),
lr=0.1,
compression_operator=TopKCompressor(k=0.01),
error_feedback=True,
memory_type="residual",
nesterov=False,
momentum=0.9,
weight_decay=1e-4
)
# Octomil API (simple)
client = octomil.OctomilClient(
compression="adaptive" # Auto-tunes everything
)
How we do it:
- Auto-tuning: Profile workload, select optimal hyperparameters
- Sensible defaults: Based on 100+ deployments
- Progressive disclosure: Simple by default, configurable if needed
3. Systems Integration
Research code runs on a single machine. Production needs distributed infrastructure.
Octomil's architecture:
┌─────────────────────────────────────────────────────────────┐
│ Client SDKs │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Python SDK │ │ Swift SDK │ │ Kotlin SDK │ │
│ │ (Research) │ │ (iOS Prod) │ │ (Android) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Control Plane (FastAPI) │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Research Algorithm Layer │ │
│ │ - FedAvg, Scafflix, Ringmaster, EF21, etc. │ │
│ └────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Production Services Layer │ │
│ │ - Device Manager - Model Registry │ │
│ │ - Round Manager - Metrics Collector │ │
│ │ - A/B Testing - Privacy Accounting │ │
│ └ ────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Data & Infrastructure │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PostgreSQL │ │ S3/MinIO │ │ Redis │ │
│ │ (Metadata) │ │ (Models) │ │ (Cache) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key point: Research algorithms are a layer within a much larger system.
4. Mobile Implementation
Most FL research uses Python/PyTorch. Production needs native mobile.
Octomil's mobile stack:
| Platform | Language | ML Runtime | Key Challenges |
|---|---|---|---|
| iOS | Swift | CoreML | NPU acceleration, background limits |
| Android | Kotlin | TFLite | Fragmentation, battery management |
| Web | TypeScript | ONNX Runtime | Limited compute, privacy in browser |
Example: Implementing EF21 (error feedback compression) on iOS:
- Research code (Python, 50 lines):
def ef21_compress(grad, error, k=0.1):
compensated = grad + error
mask = torch.abs(compensated) > torch.quantile(torch.abs(compensated), 1-k)
compressed = compensated * mask
error = compensated - compressed
return compressed, error
- Production code (Swift, 200 lines):
class EF21Compressor {
private var errorBuffer: MLMultiArray
private let compressionRatio: Float
private let accelerator: MetalCompressor // GPU acceleration
func compress(gradient: MLMultiArray) -> CompressedUpdate {
// 1. Add error compensation (vectorized via Accelerate framework)
vDSP.add(gradient, errorBuffer, result: &compensated)
// 2. Compute quantile (parallel on NPU if available)
let threshold = accelerator.quantile(compensated, q: 1 - compressionRatio)
// 3. Sparsify (Metal shader for GPU acceleration)
let compressed = accelerator.threshold(compensated, threshold: threshold)
// 4. Update error buffer
vDSP.subtract(compensated, compressed, result: &errorBuffer)
// 5. Serialize for transmission (Protocol Buffers)
return CompressedUpdate(
indices: compressed.indices,
values: compressed.values,
metadata: metadata
)
}
}
Additional complexity:
- Battery management: Pause training if battery < 20%
- Memory constraints: Stream gradients to avoid OOM
- Background limits: iOS allows ~30s in background
- Hardware acceleration: Use Apple Neural Engine when available
5. Testing at Scale
Research papers test on MNIST/CIFAR with 10-100 simulated clients.
Octomil testing:
- Unit tests: Algorithm correctness (does EF21 match paper?)
- Integration tests: End-to-end flows (device registration → training → aggregation)
- Scale tests: 10K+ simulated devices
- Production tests: Shadow mode (run new algorithm alongside stable, compare)
Example test failure that caught a bug:
# Test: EF21 convergence with 10K heterogeneous devices
def test_ef21_convergence():
devices = [
SimulatedDevice(compute_speed=speed, dropout_rate=dropout)
for speed, dropout in generate_heterogeneous_devices(10_000)
]
model = train_federated(
devices=devices,
algorithm="ef21",
rounds=100
)
assert model.accuracy > 0.95, "EF21 failed to converge with heterogeneity"
# This test FAILED in v0.4.2
# Root cause: Error buffer not properly synchronized across rounds
# Fix: Add explicit error state management
6. Monitoring & Observability
Research papers report final accuracy. Production needs real-time monitoring.
Octomil's dashboard tracks:
- Round progress: Devices participating, updates received, time elapsed
- Device health: CPU/GPU utilization, memory usage, battery impact
- Model metrics: Loss, accuracy, gradient norms, weight distributions
- Communication: Bytes sent/received, compression ratios, latencies
- Privacy: ε spent, δ probability, clipping thresholds
- Failures: Dropped devices, aggregation errors, Byzantine detections
Example alert:
ALERT: Round 47 staleness exceeds threshold
- 23% of devices have staleness > 5 (threshold)
- Recommendation: Increase staleness_threshold or reduce device count
- Affected algorithm: Ringmaster ASGD
Research Papers Implemented in Octomil
Communication Efficiency
| Paper | Technique | Octomil Feature |
|---|---|---|
| EF21 (Richtárik)2 | Error feedback compression | compression="ef21" |
| BiCoLoR (Richtárik)3 | Bidirectional compression + local training | compression="bicolor" |
| LoCoDL (Richtárik)4 | Local training + compression | compression="locodl" |
| Scafflix (Smith)5 | Personalization + local training | personalization="scafflix" |
| FedComLoc (Richtárik)6 | Sparse + quantized training | compression="fedcomloc" |
Privacy & Security
| Paper | Technique | Octomil Feature |
|---|---|---|
| Fed-α-NormEC (Richtárik)7 | Practical DP-FL | privacy="differential" |
| Clip21 (Richtárik)8 | Error feedback + DP | privacy="clip21" |
| Bounded Group Loss (Smith)9 | Fairness guarantees | fairness="bounded-group-loss" |
| Private Multi-Task (Smith)10 | Task-specific privacy | privacy="multi-task" |
Asynchronous & Heterogeneous Systems
| Paper | Technique | Octomil Feature |
|---|---|---|
| Ringmaster ASGD (Richtárik)1 | Optimal async SGD | scheduler="ringmaster" |
| Shadowheart SGD (Richtárik)11 | Compute + network hetero | scheduler="shadowheart" |
| MindFlayer SGD (Richtárik)12 | Stochastic completion times | scheduler="mindflayer" |
| Cohort Squeeze (Richtárik)13 | Multi-round cohorts | cohort_squeeze=True |
Model Compression
| Paper | Technique | Octomil Feature |
|---|---|---|
| PV-Tuning (Richtárik)14 | Extreme quantization | quantization="pv-tuning" |
| FedP3 (Richtárik)15 | Personalized pruning | pruning="fedp3" |
| RAC-LoRA (Richtárik)16 | Theoretical LoRA framework | adaptation="rac-lora" |
| Federated LoRA (Smith)17 | Sparse LoRA communication | adaptation="fedlora-sparse" |
| MicroAdam (Richtárik)18 | Low-memory optimizer | optimizer="microadam" |
Case Study: Implementing Shadowheart SGD
Let's walk through implementing one research paper end-to-end.
Research Paper (Richtárik et al., NeurIPS 2024)
Title: Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity under arbitrary computation and communication heterogeneity
Key contribution: First async algorithm optimal for both compute AND network heterogeneity.
Algorithm (simplified pseudocode):
for round t = 1, 2, ..., T:
for each device i in active_devices:
# Predict device latency
latency_i = predict_latency(device_i, model_version_t)
# Adaptive scheduling
if latency_i < threshold:
assign_task(device_i, model_version_t)
# Asynchronous aggregation
while time < round_deadline:
if update_available():
model_t = aggregate(model_t, received_updates)
broadcast(model_t)
Octomil Implementation (Staged Rollout)
Phase 1: Core Algorithm (Week 1-2)
# server/aggregators/shadowheart.py
class ShadowheartAggregator(AsyncAggregator):
def __init__(self, staleness_threshold=5):
self.staleness_threshold = staleness_threshold
self.latency_predictor = LatencyPredictor()
self.model_versions = ModelVersionManager()
def select_devices(self, available_devices, round_deadline):
# Predict latencies
predictions = {
device: self.latency_predictor.predict(device)
for device in available_devices
}
# Adaptive selection (Shadowheart scheduling logic)
selected = [
device for device, latency in predictions.items()
if latency < round_deadline * 0.8 # 80% buffer
]
return selected
def aggregate(self, updates):
# Version-aware aggregation
aggregated = torch.zeros_like(self.model)
for update in updates:
staleness = self.model_versions.current - update.version
if staleness <= self.staleness_threshold:
weight = self._compute_weight(staleness)
aggregated += weight * update.gradients
return aggregated
def _compute_weight(self, staleness):
# Shadowheart weighting scheme
return 1.0 / (1.0 + staleness)
Phase 2: Latency Prediction (Week 3)
# server/predictors/latency_predictor.py
class LatencyPredictor:
def __init__(self):
self.history = {} # device_id -> [latency samples]
self.model = RandomForestRegressor() # ML-based prediction
def predict(self, device):
if device.id not in self.history:
return self._default_latency(device)
# Features: device hardware, network, time of day, etc.
features = self._extract_features(device)
return self.model.predict([features])[0]
def update(self, device, actual_latency):
# Online learning: update predictor with actual latency
self.history[device.id].append(actual_latency)
if len(self.history[device.id]) > 100:
self._retrain()
Phase 3: Mobile SDK (Week 4-6)
// sdks/ios/Octomil/Aggregators/ShadowheartClient.swift
class ShadowheartClient: AsyncClient {
func train(model: MLModel, data: Data) async throws -> Update {
let startTime = Date()
// Local training
let gradients = try await trainLocally(model: model, data: data)
// Measure latency components
let computeTime = Date().timeIntervalSince(startTime)
let networkTime = try await measureNetworkLatency()
// Report latency for predictor
try await reportLatency(
compute: computeTime,
network: networkTime
)
// Upload update
return try await uploadUpdate(gradients: gradients)
}
}
Phase 4: Testing (Week 7)
# tests/test_shadowheart.py
def test_shadowheart_heterogeneous_devices():
# Simulate 1000 devices with 100× compute variance, 50× network variance
devices = generate_heterogeneous_devices(
count=1000,
compute_variance=100,
network_variance=50
)
# Train with Shadowheart
model, metrics = train_federated(
devices=devices,
algorithm="shadowheart",
rounds=50
)
# Verify optimal time complexity
expected_rounds = compute_optimal_rounds(devices)
assert metrics.rounds_to_convergence <= expected_rounds * 1.1 # 10% slack
Phase 5: Production Rollout (Week 8)
# Shadow mode: Run alongside existing algorithm
client = octomil.OctomilClient(
project_id="keyboard-prediction",
training_mode="shadow", # Don't use Shadowheart output yet
shadow_algorithm="shadowheart", # Compare against current
rollout_percentage=1 # All devices shadow test
)
# After 1 week of shadow testing, gradual rollout
client.set_training_mode("production")
client.set_algorithm("shadowheart")
client.gradual_rollout(start=0.01, end=1.0, duration_days=14)
Total: 8 weeks from research paper to production deployment.
Challenges We've Faced
1. Theory vs. Practice Gaps
Example: Many papers assume devices complete training deterministically. Reality: devices drop out randomly.
Solution: Extend algorithms with dropout handling (checkpointing, partial aggregation).
2. Hyperparameter Sensitivity
Example: EF21 requires tuning compression rate k per workload.
Solution: Auto-tuning via Bayesian optimization on initial rounds.
3. Mobile Constraints
Example: iOS background limits break long-running training.
Solution: Incremental training with state persistence, resume on app foreground.
4. Reproducibility
Example: Research code often lacks seed control, making results non-deterministic.
Solution: Strict seeding, deterministic operations, CI tests verify reproducibility.
What We Learned
- Start simple: Implement basic version first, optimize later
- Test at scale: MNIST results don't transfer to 10K devices
- Monitor everything: Instrumentation is not optional
- Progressive rollout: Shadow mode → 1% → 10% → 100%
- User feedback: Researchers care about convergence, engineers care about latency
Octomil's Research Partnerships
We actively collaborate with research groups:
Richtárik Lab (KAUST): Communication efficiency, asynchronous optimization Smith Lab (CMU): Fairness, privacy, personalization Yang Lab (Texas A&M): Robust optimization, AUC maximization
How we work together:
- Researchers get production feedback (does it work at scale?)
- We get early access to techniques (implement before publication)
- Joint papers on systems challenges (bridging theory and practice)
What's Next
Research areas we're tracking:
- Test-time compute for FL: Scaling inference during federated training
- LLM unlearning at scale: Efficient removal of device contributions
- Federated RL: Extending FL to reinforcement learning
- Cross-modal FL: Training across vision, language, audio simultaneously
Production features in development:
- Automatic algorithm selection (ML meta-learning for best FL algorithm)
- One-click migration from centralized to federated training
- Multi-cloud deployment (AWS, Azure, GCP simultaneously)
Join Us
For researchers: Share your papers, we'll help productionize them. For practitioners: Try Octomil, tell us what works (and what doesn't).
References
Foundational Papers
The paper that started it all:
Core optimization methods:
Personalization:
Survey: