From Research to Production: How Octomil Implements SOTA Federated Learning

February 1, 2026 · 13 min read

The federated learning research landscape is exploding. NeurIPS 2024 alone featured 100+ FL papers. ICML, ICLR, TMLR—every major venue now has substantial FL content.

But there's a chasm between research prototypes and production systems.

Research papers provide algorithms, convergence proofs, and benchmark results on MNIST/CIFAR. Production systems need to handle millions of mobile devices, unreliable networks, Byzantine attackers, GDPR compliance, and 99.9% uptime requirements.

Octomil bridges this gap. This post explains how we translate cutting-edge research into a platform you can pip install and deploy.

The Research-to-Production Gap

What Research Papers Provide

A typical FL paper includes:

Algorithm: Pseudocode for a new training method
Theory: Convergence rate analysis (e.g., O(1/√T))
Experiments: Performance on standard benchmarks
Code: Often a research prototype (if you're lucky)

Example: Richtárik et al.'s "Ringmaster ASGD"¹

Algorithm: 50 lines of pseudocode
Theory: First optimal time complexity for async SGD
Experiments: CIFAR-10, synthetic data
Code: Research implementation in PyTorch

What Production Systems Need

To deploy FL at scale:

Device management: Registration, authentication, heartbeats
Model versioning: Track every model iteration, enable rollback
Failure handling: Device dropouts, network errors, crashes
Monitoring: Real-time dashboards, alerts, metrics
Security: Byzantine robustness, differential privacy, access control
Infrastructure: Load balancing, auto-scaling, multi-region
Mobile SDKs: iOS (Swift, CoreML), Android (Kotlin, TFLite)
API design: Simple, intuitive, well-documented
Compliance: GDPR, HIPAA, SOC 2

Result: Deploying one research paper's algorithm can require 100× more engineering than the algorithm itself.

Octomil's Research-to-Production Pipeline

1. Paper Selection: 80/20 Rule

We track 50+ research groups and evaluate 200+ papers per year. Our filter:

Does this technique solve a problem Octomil users actually face?

We prioritize:

Communication efficiency: Users pay for bandwidth
Privacy: Regulations require strong guarantees
Heterogeneity: Real devices vary wildly
Compression: Edge devices have limited resources
Fairness: Production systems serve diverse populations

We don't prioritize:

Techniques that only work on IID data (unrealistic)
Algorithms requiring specialized hardware (not accessible)
Marginal improvements (1% gain not worth complexity)

Example: Richtárik et al.'s communication compression work (EF21, BiCoLoR, LoCoDL) directly addresses user pain points → Implemented.

2. Abstraction: Hide Complexity

Research algorithms have hyperparameters: learning rates, momentum, compression rates, clipping thresholds, staleness bounds.

Octomil's philosophy: Users shouldn't need a PhD to use SOTA techniques.

Our approach:

# Research paper implementation (complex)
optimizer = EFSGD(
    params=model.parameters(),
    lr=0.1,
    compression_operator=TopKCompressor(k=0.01),
    error_feedback=True,
    memory_type="residual",
    nesterov=False,
    momentum=0.9,
    weight_decay=1e-4
)

# Octomil API (simple)
client = octomil.OctomilClient(
    compression="adaptive"  # Auto-tunes everything
)

How we do it:

Auto-tuning: Profile workload, select optimal hyperparameters
Sensible defaults: Based on 100+ deployments
Progressive disclosure: Simple by default, configurable if needed

3. Systems Integration

Research code runs on a single machine. Production needs distributed infrastructure.

Octomil's architecture:

┌─────────────────────────────────────────────────────────────┐
│                    Client SDKs                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  Python SDK  │  │  Swift SDK   │  │  Kotlin SDK  │      │
│  │  (Research)  │  │  (iOS Prod)  │  │ (Android)    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Control Plane (FastAPI)                         │
│  ┌────────────────────────────────────────────────────┐     │
│  │  Research Algorithm Layer                          │     │
│  │  - FedAvg, Scafflix, Ringmaster, EF21, etc.      │     │
│  └────────────────────────────────────────────────────┘     │
│  ┌────────────────────────────────────────────────────┐     │
│  │  Production Services Layer                         │     │
│  │  - Device Manager   - Model Registry               │     │
│  │  - Round Manager    - Metrics Collector            │     │
│  │  - A/B Testing      - Privacy Accounting           │     │
│  └────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│               Data & Infrastructure                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ PostgreSQL   │  │  S3/MinIO    │  │    Redis     │      │
│  │ (Metadata)   │  │  (Models)    │  │   (Cache)    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘

Key point: Research algorithms are a layer within a much larger system.

4. Mobile Implementation

Most FL research uses Python/PyTorch. Production needs native mobile.

Octomil's mobile stack:

Platform	Language	ML Runtime	Key Challenges
iOS	Swift	CoreML	NPU acceleration, background limits
Android	Kotlin	TFLite	Fragmentation, battery management
Web	TypeScript	ONNX Runtime	Limited compute, privacy in browser

Example: Implementing EF21 (error feedback compression) on iOS:

Research code (Python, 50 lines):

def ef21_compress(grad, error, k=0.1):
    compensated = grad + error
    mask = torch.abs(compensated) > torch.quantile(torch.abs(compensated), 1-k)
    compressed = compensated * mask
    error = compensated - compressed
    return compressed, error

Production code (Swift, 200 lines):

class EF21Compressor {
    private var errorBuffer: MLMultiArray
    private let compressionRatio: Float
    private let accelerator: MetalCompressor  // GPU acceleration

    func compress(gradient: MLMultiArray) -> CompressedUpdate {
        // 1. Add error compensation (vectorized via Accelerate framework)
        vDSP.add(gradient, errorBuffer, result: &compensated)

        // 2. Compute quantile (parallel on NPU if available)
        let threshold = accelerator.quantile(compensated, q: 1 - compressionRatio)

        // 3. Sparsify (Metal shader for GPU acceleration)
        let compressed = accelerator.threshold(compensated, threshold: threshold)

        // 4. Update error buffer
        vDSP.subtract(compensated, compressed, result: &errorBuffer)

        // 5. Serialize for transmission (Protocol Buffers)
        return CompressedUpdate(
            indices: compressed.indices,
            values: compressed.values,
            metadata: metadata
        )
    }
}

Additional complexity:

Battery management: Pause training if battery < 20%
Memory constraints: Stream gradients to avoid OOM
Background limits: iOS allows ~30s in background
Hardware acceleration: Use Apple Neural Engine when available

5. Testing at Scale

Research papers test on MNIST/CIFAR with 10-100 simulated clients.

Octomil testing:

Unit tests: Algorithm correctness (does EF21 match paper?)
Integration tests: End-to-end flows (device registration → training → aggregation)
Scale tests: 10K+ simulated devices
Production tests: Shadow mode (run new algorithm alongside stable, compare)

Example test failure that caught a bug:

# Test: EF21 convergence with 10K heterogeneous devices
def test_ef21_convergence():
    devices = [
        SimulatedDevice(compute_speed=speed, dropout_rate=dropout)
        for speed, dropout in generate_heterogeneous_devices(10_000)
    ]

    model = train_federated(
        devices=devices,
        algorithm="ef21",
        rounds=100
    )

    assert model.accuracy > 0.95, "EF21 failed to converge with heterogeneity"

# This test FAILED in v0.4.2
# Root cause: Error buffer not properly synchronized across rounds
# Fix: Add explicit error state management

6. Monitoring & Observability

Research papers report final accuracy. Production needs real-time monitoring.

Octomil's dashboard tracks:

Round progress: Devices participating, updates received, time elapsed
Device health: CPU/GPU utilization, memory usage, battery impact
Model metrics: Loss, accuracy, gradient norms, weight distributions
Communication: Bytes sent/received, compression ratios, latencies
Privacy: ε spent, δ probability, clipping thresholds
Failures: Dropped devices, aggregation errors, Byzantine detections

Example alert:

ALERT: Round 47 staleness exceeds threshold
- 23% of devices have staleness > 5 (threshold)
- Recommendation: Increase staleness_threshold or reduce device count
- Affected algorithm: Ringmaster ASGD

Research Papers Implemented in Octomil

Communication Efficiency

Paper	Technique	Octomil Feature
EF21 (Richtárik)²	Error feedback compression	`compression="ef21"`
BiCoLoR (Richtárik)³	Bidirectional compression + local training	`compression="bicolor"`
LoCoDL (Richtárik)⁴	Local training + compression	`compression="locodl"`
Scafflix (Smith)⁵	Personalization + local training	`personalization="scafflix"`
FedComLoc (Richtárik)⁶	Sparse + quantized training	`compression="fedcomloc"`

Privacy & Security

Paper	Technique	Octomil Feature
Fed-α-NormEC (Richtárik)⁷	Practical DP-FL	`privacy="differential"`
Clip21 (Richtárik)⁸	Error feedback + DP	`privacy="clip21"`
Bounded Group Loss (Smith)⁹	Fairness guarantees	`fairness="bounded-group-loss"`
Private Multi-Task (Smith)¹⁰	Task-specific privacy	`privacy="multi-task"`

Asynchronous & Heterogeneous Systems

Paper	Technique	Octomil Feature
Ringmaster ASGD (Richtárik)¹	Optimal async SGD	`scheduler="ringmaster"`
Shadowheart SGD (Richtárik)¹¹	Compute + network hetero	`scheduler="shadowheart"`
MindFlayer SGD (Richtárik)¹²	Stochastic completion times	`scheduler="mindflayer"`
Cohort Squeeze (Richtárik)¹³	Multi-round cohorts	`cohort_squeeze=True`

Model Compression

Paper	Technique	Octomil Feature
PV-Tuning (Richtárik)¹⁴	Extreme quantization	`quantization="pv-tuning"`
FedP3 (Richtárik)¹⁵	Personalized pruning	`pruning="fedp3"`
RAC-LoRA (Richtárik)¹⁶	Theoretical LoRA framework	`adaptation="rac-lora"`
Federated LoRA (Smith)¹⁷	Sparse LoRA communication	`adaptation="fedlora-sparse"`
MicroAdam (Richtárik)¹⁸	Low-memory optimizer	`optimizer="microadam"`

Case Study: Implementing Shadowheart SGD

Let's walk through implementing one research paper end-to-end.

Research Paper (Richtárik et al., NeurIPS 2024)

Title: Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity under arbitrary computation and communication heterogeneity

Key contribution: First async algorithm optimal for both compute AND network heterogeneity.

Algorithm (simplified pseudocode):

for round t = 1, 2, ..., T:
    for each device i in active_devices:
        # Predict device latency
        latency_i = predict_latency(device_i, model_version_t)

        # Adaptive scheduling
        if latency_i < threshold:
            assign_task(device_i, model_version_t)

    # Asynchronous aggregation
    while time < round_deadline:
        if update_available():
            model_t = aggregate(model_t, received_updates)

    broadcast(model_t)

Octomil Implementation (Staged Rollout)

Phase 1: Core Algorithm (Week 1-2)

# server/aggregators/shadowheart.py
class ShadowheartAggregator(AsyncAggregator):
    def __init__(self, staleness_threshold=5):
        self.staleness_threshold = staleness_threshold
        self.latency_predictor = LatencyPredictor()
        self.model_versions = ModelVersionManager()

    def select_devices(self, available_devices, round_deadline):
        # Predict latencies
        predictions = {
            device: self.latency_predictor.predict(device)
            for device in available_devices
        }

        # Adaptive selection (Shadowheart scheduling logic)
        selected = [
            device for device, latency in predictions.items()
            if latency < round_deadline * 0.8  # 80% buffer
        ]

        return selected

    def aggregate(self, updates):
        # Version-aware aggregation
        aggregated = torch.zeros_like(self.model)

        for update in updates:
            staleness = self.model_versions.current - update.version
            if staleness <= self.staleness_threshold:
                weight = self._compute_weight(staleness)
                aggregated += weight * update.gradients

        return aggregated

    def _compute_weight(self, staleness):
        # Shadowheart weighting scheme
        return 1.0 / (1.0 + staleness)

Phase 2: Latency Prediction (Week 3)

# server/predictors/latency_predictor.py
class LatencyPredictor:
    def __init__(self):
        self.history = {}  # device_id -> [latency samples]
        self.model = RandomForestRegressor()  # ML-based prediction

    def predict(self, device):
        if device.id not in self.history:
            return self._default_latency(device)

        # Features: device hardware, network, time of day, etc.
        features = self._extract_features(device)
        return self.model.predict([features])[0]

    def update(self, device, actual_latency):
        # Online learning: update predictor with actual latency
        self.history[device.id].append(actual_latency)
        if len(self.history[device.id]) > 100:
            self._retrain()

Phase 3: Mobile SDK (Week 4-6)

// sdks/ios/Octomil/Aggregators/ShadowheartClient.swift
class ShadowheartClient: AsyncClient {
    func train(model: MLModel, data: Data) async throws -> Update {
        let startTime = Date()

        // Local training
        let gradients = try await trainLocally(model: model, data: data)

        // Measure latency components
        let computeTime = Date().timeIntervalSince(startTime)
        let networkTime = try await measureNetworkLatency()

        // Report latency for predictor
        try await reportLatency(
            compute: computeTime,
            network: networkTime
        )

        // Upload update
        return try await uploadUpdate(gradients: gradients)
    }
}

Phase 4: Testing (Week 7)

# tests/test_shadowheart.py
def test_shadowheart_heterogeneous_devices():
    # Simulate 1000 devices with 100× compute variance, 50× network variance
    devices = generate_heterogeneous_devices(
        count=1000,
        compute_variance=100,
        network_variance=50
    )

    # Train with Shadowheart
    model, metrics = train_federated(
        devices=devices,
        algorithm="shadowheart",
        rounds=50
    )

    # Verify optimal time complexity
    expected_rounds = compute_optimal_rounds(devices)
    assert metrics.rounds_to_convergence <= expected_rounds * 1.1  # 10% slack

Phase 5: Production Rollout (Week 8)

# Shadow mode: Run alongside existing algorithm
client = octomil.OctomilClient(
    project_id="keyboard-prediction",
    training_mode="shadow",  # Don't use Shadowheart output yet
    shadow_algorithm="shadowheart",  # Compare against current
    rollout_percentage=1  # All devices shadow test
)

# After 1 week of shadow testing, gradual rollout
client.set_training_mode("production")
client.set_algorithm("shadowheart")
client.gradual_rollout(start=0.01, end=1.0, duration_days=14)

Total: 8 weeks from research paper to production deployment.

Challenges We've Faced

1. Theory vs. Practice Gaps

Example: Many papers assume devices complete training deterministically. Reality: devices drop out randomly.

Solution: Extend algorithms with dropout handling (checkpointing, partial aggregation).

2. Hyperparameter Sensitivity

Example: EF21 requires tuning compression rate k per workload.

Solution: Auto-tuning via Bayesian optimization on initial rounds.

3. Mobile Constraints

Example: iOS background limits break long-running training.

Solution: Incremental training with state persistence, resume on app foreground.

4. Reproducibility

Example: Research code often lacks seed control, making results non-deterministic.

Solution: Strict seeding, deterministic operations, CI tests verify reproducibility.

What We Learned

Start simple: Implement basic version first, optimize later
Test at scale: MNIST results don't transfer to 10K devices
Monitor everything: Instrumentation is not optional
Progressive rollout: Shadow mode → 1% → 10% → 100%
User feedback: Researchers care about convergence, engineers care about latency

Octomil's Research Partnerships

We actively collaborate with research groups:

Richtárik Lab (KAUST): Communication efficiency, asynchronous optimization Smith Lab (CMU): Fairness, privacy, personalization Yang Lab (Texas A&M): Robust optimization, AUC maximization

How we work together:

Researchers get production feedback (does it work at scale?)
We get early access to techniques (implement before publication)
Joint papers on systems challenges (bridging theory and practice)

What's Next

Research areas we're tracking:

Test-time compute for FL: Scaling inference during federated training
LLM unlearning at scale: Efficient removal of device contributions
Federated RL: Extending FL to reinforcement learning
Cross-modal FL: Training across vision, language, audio simultaneously

Production features in development:

Automatic algorithm selection (ML meta-learning for best FL algorithm)
One-click migration from centralized to federated training
Multi-cloud deployment (AWS, Azure, GCP simultaneously)

Join Us

For researchers: Share your papers, we'll help productionize them. For practitioners: Try Octomil, tell us what works (and what doesn't).

References

Foundational Papers

The paper that started it all:

Core optimization methods:

Personalization:

Survey:

State-of-the-Art Methods (Implemented in Octomil)

Tyurin, A. & Richtárik, P. (2025). Ringmaster ASGD: The first asynchronous SGD with optimal time complexity. ICML 2025. arXiv:2501.16168 ↩ ↩²
Richtárik, P., Gasanov, E., & Burlachenko, K. (2024). Error feedback reloaded: From quadratic to arithmetic mean of smoothness constants. ICLR 2024. arXiv:2402.10774 ↩
Condat, L., Maranjyan, A., & Richtárik, P. (2026). BiCoLoR: Communication-efficient optimization with bidirectional compression and local training. arXiv:2601.12400 ↩
Condat, L., Maranjyan, A., & Richtárik, P. (2025). LoCoDL: Communication-efficient distributed learning with local training and compression. ICLR 2025 (Spotlight). arXiv:2403.04348 ↩
Yi, K., Condat, L., & Richtárik, P. (2025). Explicit personalization and local training: Double communication acceleration in federated learning (Scafflix). TMLR 2025. arXiv:2305.13170 ↩
Yi, K., Meinhardt, G., Condat, L., & Richtárik, P. (2025). FedComLoc: Communication-efficient distributed training of sparse and quantized models. TMLR 2025. arXiv:2403.09904 ↩
Shulgin, E., Malinovsky, G., Khirirat, S., & Richtárik, P. (2025). First provable guarantees for practical private FL: Beyond restrictive assumptions. arXiv:2512.21521 ↩
Khirirat, S., Gorbunov, E., Horváth, S., Islamov, R., Karray, F., & Richtárik, P. (2024). Clip21: Error feedback for gradient clipping. arXiv:2305.18929 ↩
Hu, S., Wu, Z. S., & Smith, V. (2024). Fair federated learning via bounded group loss. SaTML 2024 (Best Paper Award). Related: arXiv:2012.04221 ↩
Hu, S., Wu, Z. S., & Smith, V. (2023). Private multi-task learning: Formulation and applications to federated learning. TMLR 2023. OpenReview ↩
Tyurin, A., Pozzi, M., Ilin, I., & Richtárik, P. (2024). Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity. NeurIPS 2024. arXiv:2402.04785 ↩
Maranjyan, A., Shaikh Omar, O., & Richtárik, P. (2025). MindFlayer SGD: Efficient parallel SGD in the presence of heterogeneous and random worker compute times. UAI 2025. arXiv:2410.04285 ↩
Yi, K., Khirirat, S., & Richtárik, P. (2024). Cohort squeeze: Beyond a single communication round per cohort in cross-device federated learning. NeurIPS 2024 FL Workshop (Oral). arXiv:2406.01115 ↩
Malinovskii, V. et al. (2024). PV-Tuning: Beyond straight-through estimation for extreme LLM compression. NeurIPS 2024 (Oral, top 0.4%). arXiv:2405.14852 ↩
Yi, K., Gazagnadou, N., Richtárik, P., & Lyu, L. (2024). FedP3: Personalized and privacy-friendly federated network pruning under model heterogeneity. ICLR 2024. arXiv:2404.09816 ↩
Malinovsky, G. et al. (2024). Randomized asymmetric chain of LoRA: The first meaningful theoretical framework for low-rank adaptation (RAC-LoRA). arXiv:2410.08305 ↩
Kuo, K., Raje, A., Rajesh, K., & Smith, V. (2024). Federated LoRA with sparse communication. arXiv ↩
Modoranu, I-V. et al. (2024). MicroAdam: Accurate adaptive optimization with low space overhead and provable convergence. NeurIPS 2024. arXiv:2405.15593 ↩

The Research-to-Production Gap​

What Research Papers Provide​

What Production Systems Need​

Octomil's Research-to-Production Pipeline​

1. Paper Selection: 80/20 Rule​

2. Abstraction: Hide Complexity​

3. Systems Integration​

4. Mobile Implementation​

5. Testing at Scale​

6. Monitoring & Observability​

Research Papers Implemented in Octomil​

Communication Efficiency​

Privacy & Security​

Asynchronous & Heterogeneous Systems​

Model Compression​

Case Study: Implementing Shadowheart SGD​

Research Paper (Richtárik et al., NeurIPS 2024)​

Octomil Implementation (Staged Rollout)​

Challenges We've Faced​

1. Theory vs. Practice Gaps​

2. Hyperparameter Sensitivity​

3. Mobile Constraints​

4. Reproducibility​

What We Learned​

Octomil's Research Partnerships​

What's Next​

Join Us​

References​

Foundational Papers​

State-of-the-Art Methods (Implemented in Octomil)​

Footnotes​