Advanced FL Configuration
This page covers configuration for model compression, variance reduction, and second-order optimization methods in Octomil federated learning.
Model Compression
Deploy smaller, faster edge models using quantization, pruning, and knowledge distillation.
Post-training quantization
Convert model weights to lower precision after training:
octomil optimize model.pt --quantize int8 --output model_int8.pt
| Method | Size Reduction | Latency Improvement | Accuracy Impact |
|---|---|---|---|
| INT8 | 4x | 2-3x | Under 1% loss |
| INT4 | 8x | 3-5x | 1-3% loss |
| Mixed precision | 2-4x | 1.5-2x | Under 0.5% loss |
Pruning
Remove low-magnitude weights to reduce model size:
from octomil import ModelOptimizer
optimizer = ModelOptimizer(api_key="edg_...")
optimized = optimizer.prune(
model_id="speech-model",
sparsity=0.5, # Remove 50% of weights
method="magnitude", # or "structured"
fine_tune_rounds=5, # Fine-tune after pruning
)
Knowledge distillation
Train a smaller "student" model to mimic a larger "teacher":
result = federation.train(
model="student-model",
algorithm="fedavg",
distillation={
"teacher_model_id": "large-model",
"temperature": 3.0,
"alpha": 0.7, # Weight of distillation loss vs task loss
},
rounds=20,
)
Variance Reduction
Non-IID data across devices causes client drift -- local updates diverge from the global optimum. Variance reduction techniques correct for this drift.
SCAFFOLD
SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) uses control variates to correct client drift. Each client maintains a control variate that tracks the difference between local and global gradients.
federation.train(
model="heterogeneous-classifier",
algorithm="scaffold",
rounds=50,
min_updates=20,
)
When to use: Non-IID data distributions across devices. SCAFFOLD converges faster than FedAvg on heterogeneous data but requires 2x communication (control variates are exchanged alongside model updates).
FedProx
FedProx adds a proximal term to the local objective that penalizes deviation from the global model. Simpler than SCAFFOLD, requires no additional communication.
federation.train(
model="mixed-fleet-model",
algorithm="fedprox",
fedprox_mu=0.01, # Proximal term strength
rounds=50,
)
When to use: Moderate data heterogeneity. Less effective than SCAFFOLD on severely non-IID data, but zero communication overhead.
Second-Order Methods
Standard FL uses first-order methods (SGD, Adam). Second-order methods use curvature information for faster convergence, especially on heterogeneous data.
FedNova
FedNova (Federated Normalized Averaging) normalizes client updates by the number of local steps, correcting the objective inconsistency that arises when devices perform different amounts of local computation.
federation.train(
model="variable-compute-model",
algorithm="fednova",
rounds=30,
)
When to use: Devices with varying compute capabilities that perform different numbers of local epochs.
Natural gradient methods
For advanced users, Octomil supports natural gradient approximations that use Fisher information to precondition updates:
federation.train(
model="advanced-model",
algorithm="fedavg",
optimizer="natural_gradient",
fisher_samples=100,
rounds=50,
)
Comparison
| Method | Communication Cost | Non-IID Performance | Compute Overhead |
|---|---|---|---|
| FedAvg | 1x (baseline) | Poor on severe non-IID | Low |
| FedProx | 1x | Moderate | Low |
| SCAFFOLD | 2x | Strong | Moderate |
| FedNova | 1x | Good (variable compute) | Low |
| FedAdam | 1x | Good | Low |