Skip to main content

From Research to Production: How Octomil Implements SOTA Federated Learning

· 13 min read

The federated learning research landscape is exploding. NeurIPS 2024 alone featured 100+ FL papers. ICML, ICLR, TMLR—every major venue now has substantial FL content.

But there's a chasm between research prototypes and production systems.

Research papers provide algorithms, convergence proofs, and benchmark results on MNIST/CIFAR. Production systems need to handle millions of mobile devices, unreliable networks, Byzantine attackers, GDPR compliance, and 99.9% uptime requirements.

Octomil bridges this gap. This post explains how we translate cutting-edge research into a platform you can pip install and deploy.

The Research-to-Production Gap

What Research Papers Provide

A typical FL paper includes:

  • Algorithm: Pseudocode for a new training method
  • Theory: Convergence rate analysis (e.g., O(1/√T))
  • Experiments: Performance on standard benchmarks
  • Code: Often a research prototype (if you're lucky)

Example: Richtárik et al.'s "Ringmaster ASGD"1

  • Algorithm: 50 lines of pseudocode
  • Theory: First optimal time complexity for async SGD
  • Experiments: CIFAR-10, synthetic data
  • Code: Research implementation in PyTorch

What Production Systems Need

To deploy FL at scale:

  • Device management: Registration, authentication, heartbeats
  • Model versioning: Track every model iteration, enable rollback
  • Failure handling: Device dropouts, network errors, crashes
  • Monitoring: Real-time dashboards, alerts, metrics
  • Security: Byzantine robustness, differential privacy, access control
  • Infrastructure: Load balancing, auto-scaling, multi-region
  • Mobile SDKs: iOS (Swift, CoreML), Android (Kotlin, TFLite)
  • API design: Simple, intuitive, well-documented
  • Compliance: GDPR, HIPAA, SOC 2

Result: Deploying one research paper's algorithm can require 100× more engineering than the algorithm itself.

Octomil's Research-to-Production Pipeline

1. Paper Selection: 80/20 Rule

We track 50+ research groups and evaluate 200+ papers per year. Our filter:

Does this technique solve a problem Octomil users actually face?

We prioritize:

  • Communication efficiency: Users pay for bandwidth
  • Privacy: Regulations require strong guarantees
  • Heterogeneity: Real devices vary wildly
  • Compression: Edge devices have limited resources
  • Fairness: Production systems serve diverse populations

We don't prioritize:

  • Techniques that only work on IID data (unrealistic)
  • Algorithms requiring specialized hardware (not accessible)
  • Marginal improvements (1% gain not worth complexity)

Example: Richtárik et al.'s communication compression work (EF21, BiCoLoR, LoCoDL) directly addresses user pain points → Implemented.

2. Abstraction: Hide Complexity

Research algorithms have hyperparameters: learning rates, momentum, compression rates, clipping thresholds, staleness bounds.

Octomil's philosophy: Users shouldn't need a PhD to use SOTA techniques.

Our approach:

# Research paper implementation (complex)
optimizer = EFSGD(
params=model.parameters(),
lr=0.1,
compression_operator=TopKCompressor(k=0.01),
error_feedback=True,
memory_type="residual",
nesterov=False,
momentum=0.9,
weight_decay=1e-4
)

# Octomil API (simple)
client = octomil.OctomilClient(
compression="adaptive" # Auto-tunes everything
)

How we do it:

  • Auto-tuning: Profile workload, select optimal hyperparameters
  • Sensible defaults: Based on 100+ deployments
  • Progressive disclosure: Simple by default, configurable if needed

3. Systems Integration

Research code runs on a single machine. Production needs distributed infrastructure.

Octomil's architecture:

┌─────────────────────────────────────────────────────────────┐
│ Client SDKs │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Python SDK │ │ Swift SDK │ │ Kotlin SDK │ │
│ │ (Research) │ │ (iOS Prod) │ │ (Android) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Control Plane (FastAPI) │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Research Algorithm Layer │ │
│ │ - FedAvg, Scafflix, Ringmaster, EF21, etc. │ │
│ └────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Production Services Layer │ │
│ │ - Device Manager - Model Registry │ │
│ │ - Round Manager - Metrics Collector │ │
│ │ - A/B Testing - Privacy Accounting │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Data & Infrastructure │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PostgreSQL │ │ S3/MinIO │ │ Redis │ │
│ │ (Metadata) │ │ (Models) │ │ (Cache) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘

Key point: Research algorithms are a layer within a much larger system.

4. Mobile Implementation

Most FL research uses Python/PyTorch. Production needs native mobile.

Octomil's mobile stack:

PlatformLanguageML RuntimeKey Challenges
iOSSwiftCoreMLNPU acceleration, background limits
AndroidKotlinTFLiteFragmentation, battery management
WebTypeScriptONNX RuntimeLimited compute, privacy in browser

Example: Implementing EF21 (error feedback compression) on iOS:

  1. Research code (Python, 50 lines):
def ef21_compress(grad, error, k=0.1):
compensated = grad + error
mask = torch.abs(compensated) > torch.quantile(torch.abs(compensated), 1-k)
compressed = compensated * mask
error = compensated - compressed
return compressed, error
  1. Production code (Swift, 200 lines):
class EF21Compressor {
private var errorBuffer: MLMultiArray
private let compressionRatio: Float
private let accelerator: MetalCompressor // GPU acceleration

func compress(gradient: MLMultiArray) -> CompressedUpdate {
// 1. Add error compensation (vectorized via Accelerate framework)
vDSP.add(gradient, errorBuffer, result: &compensated)

// 2. Compute quantile (parallel on NPU if available)
let threshold = accelerator.quantile(compensated, q: 1 - compressionRatio)

// 3. Sparsify (Metal shader for GPU acceleration)
let compressed = accelerator.threshold(compensated, threshold: threshold)

// 4. Update error buffer
vDSP.subtract(compensated, compressed, result: &errorBuffer)

// 5. Serialize for transmission (Protocol Buffers)
return CompressedUpdate(
indices: compressed.indices,
values: compressed.values,
metadata: metadata
)
}
}

Additional complexity:

  • Battery management: Pause training if battery < 20%
  • Memory constraints: Stream gradients to avoid OOM
  • Background limits: iOS allows ~30s in background
  • Hardware acceleration: Use Apple Neural Engine when available

5. Testing at Scale

Research papers test on MNIST/CIFAR with 10-100 simulated clients.

Octomil testing:

  • Unit tests: Algorithm correctness (does EF21 match paper?)
  • Integration tests: End-to-end flows (device registration → training → aggregation)
  • Scale tests: 10K+ simulated devices
  • Production tests: Shadow mode (run new algorithm alongside stable, compare)

Example test failure that caught a bug:

# Test: EF21 convergence with 10K heterogeneous devices
def test_ef21_convergence():
devices = [
SimulatedDevice(compute_speed=speed, dropout_rate=dropout)
for speed, dropout in generate_heterogeneous_devices(10_000)
]

model = train_federated(
devices=devices,
algorithm="ef21",
rounds=100
)

assert model.accuracy > 0.95, "EF21 failed to converge with heterogeneity"

# This test FAILED in v0.4.2
# Root cause: Error buffer not properly synchronized across rounds
# Fix: Add explicit error state management

6. Monitoring & Observability

Research papers report final accuracy. Production needs real-time monitoring.

Octomil's dashboard tracks:

  • Round progress: Devices participating, updates received, time elapsed
  • Device health: CPU/GPU utilization, memory usage, battery impact
  • Model metrics: Loss, accuracy, gradient norms, weight distributions
  • Communication: Bytes sent/received, compression ratios, latencies
  • Privacy: ε spent, δ probability, clipping thresholds
  • Failures: Dropped devices, aggregation errors, Byzantine detections

Example alert:

ALERT: Round 47 staleness exceeds threshold
- 23% of devices have staleness > 5 (threshold)
- Recommendation: Increase staleness_threshold or reduce device count
- Affected algorithm: Ringmaster ASGD

Research Papers Implemented in Octomil

Communication Efficiency

PaperTechniqueOctomil Feature
EF21 (Richtárik)2Error feedback compressioncompression="ef21"
BiCoLoR (Richtárik)3Bidirectional compression + local trainingcompression="bicolor"
LoCoDL (Richtárik)4Local training + compressioncompression="locodl"
Scafflix (Smith)5Personalization + local trainingpersonalization="scafflix"
FedComLoc (Richtárik)6Sparse + quantized trainingcompression="fedcomloc"

Privacy & Security

PaperTechniqueOctomil Feature
Fed-α-NormEC (Richtárik)7Practical DP-FLprivacy="differential"
Clip21 (Richtárik)8Error feedback + DPprivacy="clip21"
Bounded Group Loss (Smith)9Fairness guaranteesfairness="bounded-group-loss"
Private Multi-Task (Smith)10Task-specific privacyprivacy="multi-task"

Asynchronous & Heterogeneous Systems

PaperTechniqueOctomil Feature
Ringmaster ASGD (Richtárik)1Optimal async SGDscheduler="ringmaster"
Shadowheart SGD (Richtárik)11Compute + network heteroscheduler="shadowheart"
MindFlayer SGD (Richtárik)12Stochastic completion timesscheduler="mindflayer"
Cohort Squeeze (Richtárik)13Multi-round cohortscohort_squeeze=True

Model Compression

PaperTechniqueOctomil Feature
PV-Tuning (Richtárik)14Extreme quantizationquantization="pv-tuning"
FedP3 (Richtárik)15Personalized pruningpruning="fedp3"
RAC-LoRA (Richtárik)16Theoretical LoRA frameworkadaptation="rac-lora"
Federated LoRA (Smith)17Sparse LoRA communicationadaptation="fedlora-sparse"
MicroAdam (Richtárik)18Low-memory optimizeroptimizer="microadam"

Case Study: Implementing Shadowheart SGD

Let's walk through implementing one research paper end-to-end.

Research Paper (Richtárik et al., NeurIPS 2024)

Title: Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity under arbitrary computation and communication heterogeneity

Key contribution: First async algorithm optimal for both compute AND network heterogeneity.

Algorithm (simplified pseudocode):

for round t = 1, 2, ..., T:
for each device i in active_devices:
# Predict device latency
latency_i = predict_latency(device_i, model_version_t)

# Adaptive scheduling
if latency_i < threshold:
assign_task(device_i, model_version_t)

# Asynchronous aggregation
while time < round_deadline:
if update_available():
model_t = aggregate(model_t, received_updates)

broadcast(model_t)

Octomil Implementation (Staged Rollout)

Phase 1: Core Algorithm (Week 1-2)

# server/aggregators/shadowheart.py
class ShadowheartAggregator(AsyncAggregator):
def __init__(self, staleness_threshold=5):
self.staleness_threshold = staleness_threshold
self.latency_predictor = LatencyPredictor()
self.model_versions = ModelVersionManager()

def select_devices(self, available_devices, round_deadline):
# Predict latencies
predictions = {
device: self.latency_predictor.predict(device)
for device in available_devices
}

# Adaptive selection (Shadowheart scheduling logic)
selected = [
device for device, latency in predictions.items()
if latency < round_deadline * 0.8 # 80% buffer
]

return selected

def aggregate(self, updates):
# Version-aware aggregation
aggregated = torch.zeros_like(self.model)

for update in updates:
staleness = self.model_versions.current - update.version
if staleness <= self.staleness_threshold:
weight = self._compute_weight(staleness)
aggregated += weight * update.gradients

return aggregated

def _compute_weight(self, staleness):
# Shadowheart weighting scheme
return 1.0 / (1.0 + staleness)

Phase 2: Latency Prediction (Week 3)

# server/predictors/latency_predictor.py
class LatencyPredictor:
def __init__(self):
self.history = {} # device_id -> [latency samples]
self.model = RandomForestRegressor() # ML-based prediction

def predict(self, device):
if device.id not in self.history:
return self._default_latency(device)

# Features: device hardware, network, time of day, etc.
features = self._extract_features(device)
return self.model.predict([features])[0]

def update(self, device, actual_latency):
# Online learning: update predictor with actual latency
self.history[device.id].append(actual_latency)
if len(self.history[device.id]) > 100:
self._retrain()

Phase 3: Mobile SDK (Week 4-6)

// sdks/ios/Octomil/Aggregators/ShadowheartClient.swift
class ShadowheartClient: AsyncClient {
func train(model: MLModel, data: Data) async throws -> Update {
let startTime = Date()

// Local training
let gradients = try await trainLocally(model: model, data: data)

// Measure latency components
let computeTime = Date().timeIntervalSince(startTime)
let networkTime = try await measureNetworkLatency()

// Report latency for predictor
try await reportLatency(
compute: computeTime,
network: networkTime
)

// Upload update
return try await uploadUpdate(gradients: gradients)
}
}

Phase 4: Testing (Week 7)

# tests/test_shadowheart.py
def test_shadowheart_heterogeneous_devices():
# Simulate 1000 devices with 100× compute variance, 50× network variance
devices = generate_heterogeneous_devices(
count=1000,
compute_variance=100,
network_variance=50
)

# Train with Shadowheart
model, metrics = train_federated(
devices=devices,
algorithm="shadowheart",
rounds=50
)

# Verify optimal time complexity
expected_rounds = compute_optimal_rounds(devices)
assert metrics.rounds_to_convergence <= expected_rounds * 1.1 # 10% slack

Phase 5: Production Rollout (Week 8)

# Shadow mode: Run alongside existing algorithm
client = octomil.OctomilClient(
project_id="keyboard-prediction",
training_mode="shadow", # Don't use Shadowheart output yet
shadow_algorithm="shadowheart", # Compare against current
rollout_percentage=1 # All devices shadow test
)

# After 1 week of shadow testing, gradual rollout
client.set_training_mode("production")
client.set_algorithm("shadowheart")
client.gradual_rollout(start=0.01, end=1.0, duration_days=14)

Total: 8 weeks from research paper to production deployment.

Challenges We've Faced

1. Theory vs. Practice Gaps

Example: Many papers assume devices complete training deterministically. Reality: devices drop out randomly.

Solution: Extend algorithms with dropout handling (checkpointing, partial aggregation).

2. Hyperparameter Sensitivity

Example: EF21 requires tuning compression rate k per workload.

Solution: Auto-tuning via Bayesian optimization on initial rounds.

3. Mobile Constraints

Example: iOS background limits break long-running training.

Solution: Incremental training with state persistence, resume on app foreground.

4. Reproducibility

Example: Research code often lacks seed control, making results non-deterministic.

Solution: Strict seeding, deterministic operations, CI tests verify reproducibility.

What We Learned

  1. Start simple: Implement basic version first, optimize later
  2. Test at scale: MNIST results don't transfer to 10K devices
  3. Monitor everything: Instrumentation is not optional
  4. Progressive rollout: Shadow mode → 1% → 10% → 100%
  5. User feedback: Researchers care about convergence, engineers care about latency

Octomil's Research Partnerships

We actively collaborate with research groups:

Richtárik Lab (KAUST): Communication efficiency, asynchronous optimization Smith Lab (CMU): Fairness, privacy, personalization Yang Lab (Texas A&M): Robust optimization, AUC maximization

How we work together:

  • Researchers get production feedback (does it work at scale?)
  • We get early access to techniques (implement before publication)
  • Joint papers on systems challenges (bridging theory and practice)

What's Next

Research areas we're tracking:

  • Test-time compute for FL: Scaling inference during federated training
  • LLM unlearning at scale: Efficient removal of device contributions
  • Federated RL: Extending FL to reinforcement learning
  • Cross-modal FL: Training across vision, language, audio simultaneously

Production features in development:

  • Automatic algorithm selection (ML meta-learning for best FL algorithm)
  • One-click migration from centralized to federated training
  • Multi-cloud deployment (AWS, Azure, GCP simultaneously)

Join Us

For researchers: Share your papers, we'll help productionize them. For practitioners: Try Octomil, tell us what works (and what doesn't).


References

Foundational Papers

The paper that started it all:

Core optimization methods:

Personalization:

Survey:


State-of-the-Art Methods (Implemented in Octomil)

Footnotes

  1. Tyurin, A. & Richtárik, P. (2025). Ringmaster ASGD: The first asynchronous SGD with optimal time complexity. ICML 2025. arXiv:2501.16168 2

  2. Richtárik, P., Gasanov, E., & Burlachenko, K. (2024). Error feedback reloaded: From quadratic to arithmetic mean of smoothness constants. ICLR 2024. arXiv:2402.10774

  3. Condat, L., Maranjyan, A., & Richtárik, P. (2026). BiCoLoR: Communication-efficient optimization with bidirectional compression and local training. arXiv:2601.12400

  4. Condat, L., Maranjyan, A., & Richtárik, P. (2025). LoCoDL: Communication-efficient distributed learning with local training and compression. ICLR 2025 (Spotlight). arXiv:2403.04348

  5. Yi, K., Condat, L., & Richtárik, P. (2025). Explicit personalization and local training: Double communication acceleration in federated learning (Scafflix). TMLR 2025. arXiv:2305.13170

  6. Yi, K., Meinhardt, G., Condat, L., & Richtárik, P. (2025). FedComLoc: Communication-efficient distributed training of sparse and quantized models. TMLR 2025. arXiv:2403.09904

  7. Shulgin, E., Malinovsky, G., Khirirat, S., & Richtárik, P. (2025). First provable guarantees for practical private FL: Beyond restrictive assumptions. arXiv:2512.21521

  8. Khirirat, S., Gorbunov, E., Horváth, S., Islamov, R., Karray, F., & Richtárik, P. (2024). Clip21: Error feedback for gradient clipping. arXiv:2305.18929

  9. Hu, S., Wu, Z. S., & Smith, V. (2024). Fair federated learning via bounded group loss. SaTML 2024 (Best Paper Award). Related: arXiv:2012.04221

  10. Hu, S., Wu, Z. S., & Smith, V. (2023). Private multi-task learning: Formulation and applications to federated learning. TMLR 2023. OpenReview

  11. Tyurin, A., Pozzi, M., Ilin, I., & Richtárik, P. (2024). Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity. NeurIPS 2024. arXiv:2402.04785

  12. Maranjyan, A., Shaikh Omar, O., & Richtárik, P. (2025). MindFlayer SGD: Efficient parallel SGD in the presence of heterogeneous and random worker compute times. UAI 2025. arXiv:2410.04285

  13. Yi, K., Khirirat, S., & Richtárik, P. (2024). Cohort squeeze: Beyond a single communication round per cohort in cross-device federated learning. NeurIPS 2024 FL Workshop (Oral). arXiv:2406.01115

  14. Malinovskii, V. et al. (2024). PV-Tuning: Beyond straight-through estimation for extreme LLM compression. NeurIPS 2024 (Oral, top 0.4%). arXiv:2405.14852

  15. Yi, K., Gazagnadou, N., Richtárik, P., & Lyu, L. (2024). FedP3: Personalized and privacy-friendly federated network pruning under model heterogeneity. ICLR 2024. arXiv:2404.09816

  16. Malinovsky, G. et al. (2024). Randomized asymmetric chain of LoRA: The first meaningful theoretical framework for low-rank adaptation (RAC-LoRA). arXiv:2410.08305

  17. Kuo, K., Raje, A., Rajesh, K., & Smith, V. (2024). Federated LoRA with sparse communication. arXiv

  18. Modoranu, I-V. et al. (2024). MicroAdam: Accurate adaptive optimization with low space overhead and provable convergence. NeurIPS 2024. arXiv:2405.15593