Skip to main content

12 posts tagged with "federated-learning"

View All Tags

The Seven Vectors of Convergence: Why On-Device AI Is Inevitable

· 26 min read

February 2026

Technology paradigm shifts do not arrive as single breakthroughs. They arrive as convergences -- multiple independent trends, each advancing on its own trajectory, reaching a critical density at the same moment in time. The PC revolution required cheap transistors, graphical interfaces, and spreadsheet software simultaneously. The mobile revolution required capacitive touchscreens, 3G networks, and app distribution simultaneously. Cloud computing required virtualization, broadband ubiquity, and pay-per-use billing simultaneously.

We are now witnessing a convergence of equal magnitude. Seven independent vectors -- in hardware, software optimization, regulation, economics, device proliferation, application architecture, and developer infrastructure -- are aligning toward a single, unavoidable conclusion: the future of AI inference is on-device, and the future of AI improvement is federated.

This paper traces each vector with specificity, projects where each leads, and demonstrates why their intersection creates one of the largest platform opportunities in the history of computing.

Beyond Cross-Entropy: Federated AUC Maximization and X-Risk Optimization

· 9 min read

The default objective in machine learning is minimizing cross-entropy loss. It's what PyTorch uses out-of-the-box. It's what most federated learning papers optimize. It's simple, well-understood, and works great for balanced classification problems.

But real-world applications rarely have balanced classes or standard objectives.

  • Medical diagnosis: 1% of patients have disease → 99% accuracy means predicting "no disease" for everyone
  • Fraud detection: 0.1% of transactions are fraudulent → standard loss fails
  • Financial risk: We care about tail risk (the 1% worst-case scenarios), not average loss
  • Imbalanced federated data: Some devices have only positive examples, others only negatives

This post explores specialized optimization objectives for federated learning and how Octomil supports them through recent breakthroughs from Guo, Yang, and collaborators.

Byzantine-Robust FL: Defending Against Malicious Devices

· 8 min read

Federated learning has an adversary problem.

When training across thousands or millions of devices, you can't trust everyone. Some devices may be:

  • Compromised by malware
  • Malicious (intentionally poisoning the model)
  • Faulty (hardware errors, bugs)
  • Adversarially motivated (competitors, attackers)

A single malicious device uploading carefully crafted gradients can completely destroy model accuracy. Without defenses, federated learning is vulnerable to Byzantine attacks—named after the Byzantine Generals' Problem where some participants may be traitors.

This post explores Byzantine-robust aggregation methods and how Octomil implements defenses against adversarial devices.

Federated LLMs: Prompting, Cascading, and Fine-Tuning at Scale

· 11 min read

Large Language Models have changed everything—including federated learning.

The old FL paradigm: Train a small model (~100M parameters) from scratch across devices.

The new FL paradigm: Adapt a massive pre-trained model (7B-70B parameters) using federated techniques.

But LLMs bring unique challenges to federated learning:

  • Size: 7B parameters = 28 GB (won't fit on most devices)
  • Compute: Full fine-tuning requires massive GPU memory
  • Inference cost: Running LLM inference on-device drains battery
  • Privacy: LLM memorization can leak training data

This post explores cutting-edge techniques for federated LLMs, from Virginia Smith's research group and beyond, showing how to make federated learning work in the foundation model era.

Second-Order Federated Learning: When Newton Beats SGD

· 10 min read

Federated learning loves first-order methods. FedAvg, SCAFFOLD, FedProx—they all use gradients (first derivatives). They're simple, memory-efficient, and work reasonably well.

But here's a provocative question: What if we could converge in 10 rounds instead of 100?

Second-order methods—using curvature information (Hessians, second derivatives)—can achieve dramatically faster convergence by taking smarter steps. The classic tradeoff: Fewer iterations, but more computation per iteration.

In federated learning, this tradeoff flips in our favor: Communication is expensive, computation is cheap (especially with modern accelerators). Trading computation for communication is exactly what we want.

This post explores second-order methods for FL and shows when they're worth the extra compute.

Variance Reduction: The Secret to Fast FL Convergence

· 10 min read

Why does federated learning take so many communication rounds to converge?

A typical FL training job might require:

  • Standard SGD: 1,000+ rounds to converge
  • With variance reduction: 100-200 rounds to converge
  • Result: 5-10× speedup in wall-clock time

Variance reduction is the algorithmic technique that makes this possible. It's the difference between federated learning being a research curiosity and a production-viable technology.

This post dives into variance reduction methods—MARINA, PAGE, SAGA, and their variants—and explains why they're fundamental to efficient federated learning.

Communication-Efficient Federated Learning: From Theory to Production

· 6 min read

Communication is the most expensive operation in federated learning. While devices have increasingly powerful processors, network bandwidth remains constrained—especially on mobile devices with unreliable connections. This fundamental bottleneck has driven a decade of research into communication-efficient FL techniques.

In this post, we explore state-of-the-art communication reduction methods and show how Octomil implements these techniques for production use.

Handling Device Heterogeneity: Asynchronous FL for the Real World

· 9 min read

The textbook version of federated learning assumes a perfect world:

  • All devices have similar compute power
  • Network connections are equally fast
  • Devices complete training at roughly the same time
  • No one drops out mid-round

Reality: None of these assumptions hold.

In production FL, you're coordinating across:

  • iPhone 15 Pro (6-core CPU, 16-core GPU) vs. budget Android (4-core, no GPU)
  • 5G fiber (1 Gbps) vs. rural 3G (0.5 Mbps)
  • Always-plugged smart display vs. battery-conscious smartphone
  • Reliable edge server vs. intermittent mobile device

This post explores how Octomil handles the chaos of real-world device heterogeneity through asynchronous federated learning.

Model Compression for Edge Devices: Making LLMs Run on Smartphones

· 10 min read

The irony of modern federated learning: We want to train sophisticated models on edge devices, but those same devices often can't even run the models.

A state-of-the-art language model has 7B+ parameters (~28 GB at 32-bit). An iPhone 15 Pro has 8 GB RAM. This math doesn't work.

Model compression techniques—quantization, pruning, low-rank adaptation—are not just optimizations; they're prerequisites for production FL on edge devices. This post explores cutting-edge compression methods and how Octomil makes them accessible.

Personalized Federated Learning: One Global Model, Many Local Needs

· 7 min read

The fundamental premise of federated learning is to train a single global model across diverse devices. But what happens when "one size fits all" doesn't fit anyone particularly well?

The personalization dilemma: A global keyboard prediction model trained on millions of devices might be mediocre for everyone—users who text in multiple languages, users with specialized vocabularies (medical, legal), or users with unique writing styles all suffer from a lowest-common-denominator model.

This post explores how personalized federated learning enables Octomil to deliver both collective intelligence and individual adaptation.