Skip to main content

Advanced FL Configuration

This page covers configuration for model compression, variance reduction, and second-order optimization methods in Octomil federated learning.

Model Compression

Deploy smaller, faster edge models using quantization, pruning, and knowledge distillation.

Post-training quantization

Convert model weights to lower precision after training:

octomil optimize model.pt --quantize int8 --output model_int8.pt
MethodSize ReductionLatency ImprovementAccuracy Impact
INT84x2-3xUnder 1% loss
INT48x3-5x1-3% loss
Mixed precision2-4x1.5-2xUnder 0.5% loss

Pruning

Remove low-magnitude weights to reduce model size:

from octomil import ModelOptimizer

optimizer = ModelOptimizer(api_key="edg_...")
optimized = optimizer.prune(
model_id="speech-model",
sparsity=0.5, # Remove 50% of weights
method="magnitude", # or "structured"
fine_tune_rounds=5, # Fine-tune after pruning
)

Knowledge distillation

Train a smaller "student" model to mimic a larger "teacher":

result = federation.train(
model="student-model",
algorithm="fedavg",
distillation={
"teacher_model_id": "large-model",
"temperature": 3.0,
"alpha": 0.7, # Weight of distillation loss vs task loss
},
rounds=20,
)

Variance Reduction

Non-IID data across devices causes client drift -- local updates diverge from the global optimum. Variance reduction techniques correct for this drift.

SCAFFOLD

SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) uses control variates to correct client drift. Each client maintains a control variate that tracks the difference between local and global gradients.

federation.train(
model="heterogeneous-classifier",
algorithm="scaffold",
rounds=50,
min_updates=20,
)

When to use: Non-IID data distributions across devices. SCAFFOLD converges faster than FedAvg on heterogeneous data but requires 2x communication (control variates are exchanged alongside model updates).

FedProx

FedProx adds a proximal term to the local objective that penalizes deviation from the global model. Simpler than SCAFFOLD, requires no additional communication.

federation.train(
model="mixed-fleet-model",
algorithm="fedprox",
fedprox_mu=0.01, # Proximal term strength
rounds=50,
)

When to use: Moderate data heterogeneity. Less effective than SCAFFOLD on severely non-IID data, but zero communication overhead.

Second-Order Methods

Standard FL uses first-order methods (SGD, Adam). Second-order methods use curvature information for faster convergence, especially on heterogeneous data.

FedNova

FedNova (Federated Normalized Averaging) normalizes client updates by the number of local steps, correcting the objective inconsistency that arises when devices perform different amounts of local computation.

federation.train(
model="variable-compute-model",
algorithm="fednova",
rounds=30,
)

When to use: Devices with varying compute capabilities that perform different numbers of local epochs.

Natural gradient methods

For advanced users, Octomil supports natural gradient approximations that use Fisher information to precondition updates:

federation.train(
model="advanced-model",
algorithm="fedavg",
optimizer="natural_gradient",
fisher_samples=100,
rounds=50,
)

Comparison

MethodCommunication CostNon-IID PerformanceCompute Overhead
FedAvg1x (baseline)Poor on severe non-IIDLow
FedProx1xModerateLow
SCAFFOLD2xStrongModerate
FedNova1xGood (variable compute)Low
FedAdam1xGoodLow