Second-Order Federated Learning: When Newton Beats SGD

February 2, 2026 · 10 min read

Federated learning loves first-order methods. FedAvg, SCAFFOLD, FedProx—they all use gradients (first derivatives). They're simple, memory-efficient, and work reasonably well.

But here's a provocative question: What if we could converge in 10 rounds instead of 100?

Second-order methods—using curvature information (Hessians, second derivatives)—can achieve dramatically faster convergence by taking smarter steps. The classic tradeoff: Fewer iterations, but more computation per iteration.

In federated learning, this tradeoff flips in our favor: Communication is expensive, computation is cheap (especially with modern accelerators). Trading computation for communication is exactly what we want.

This post explores second-order methods for FL and shows when they're worth the extra compute.