Advanced FL Configuration

This page covers configuration for model compression, variance reduction, and second-order optimization methods in Octomil federated learning.

Model Compression

Deploy smaller, faster edge models using quantization, pruning, and knowledge distillation.

Post-training quantization

Convert model weights to lower precision after training:

octomil optimize model.pt --quantize int8 --output model_int8.pt

Method	Size Reduction	Latency Improvement	Accuracy Impact
INT8	4x	2-3x	Under 1% loss
INT4	8x	3-5x	1-3% loss
Mixed precision	2-4x	1.5-2x	Under 0.5% loss

Pruning

Remove low-magnitude weights to reduce model size:

from octomil import ModelOptimizer

optimizer = ModelOptimizer(api_key="oct_sk_live_...")
optimized = optimizer.prune(
    model_id="speech-model",
    sparsity=0.5,          # Remove 50% of weights
    method="magnitude",     # or "structured"
    fine_tune_rounds=5,     # Fine-tune after pruning
)

Knowledge distillation

Train a smaller "student" model to mimic a larger "teacher":

result = federation.train(
    model="student-model",
    algorithm="fedavg",
    distillation={
        "teacher_model_id": "large-model",
        "temperature": 3.0,
        "alpha": 0.7,  # Weight of distillation loss vs task loss
    },
    rounds=20,
)

Variance Reduction

Non-IID data across devices causes client drift -- local updates diverge from the global optimum. Variance reduction techniques correct for this drift.

SCAFFOLD

SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) uses control variates to correct client drift. Each client maintains a control variate that tracks the difference between local and global gradients.

federation.train(
    model="heterogeneous-classifier",
    algorithm="scaffold",
    rounds=50,
    min_updates=20,
)

When to use: Non-IID data distributions across devices. SCAFFOLD converges faster than FedAvg on heterogeneous data but requires 2x communication (control variates are exchanged alongside model updates).

FedProx

FedProx adds a proximal term to the local objective that penalizes deviation from the global model. Simpler than SCAFFOLD, requires no additional communication.

federation.train(
    model="mixed-fleet-model",
    algorithm="fedprox",
    fedprox_mu=0.01,  # Proximal term strength
    rounds=50,
)

When to use: Moderate data heterogeneity. Less effective than SCAFFOLD on severely non-IID data, but zero communication overhead.

Second-Order Methods

Standard FL uses first-order methods (SGD, Adam). Second-order methods use curvature information for faster convergence, especially on heterogeneous data.

FedNova

FedNova (Federated Normalized Averaging) normalizes client updates by the number of local steps, correcting the objective inconsistency that arises when devices perform different amounts of local computation.

federation.train(
    model="variable-compute-model",
    algorithm="fednova",
    rounds=30,
)

When to use: Devices with varying compute capabilities that perform different numbers of local epochs.

Natural gradient methods

For advanced users, Octomil supports natural gradient approximations that use Fisher information to precondition updates:

federation.train(
    model="advanced-model",
    algorithm="fedavg",
    optimizer="natural_gradient",
    fisher_samples=100,
    rounds=50,
)

Comparison

Method	Communication Cost	Non-IID Performance	Compute Overhead
FedAvg	1x (baseline)	Poor on severe non-IID	Low
FedProx	1x	Moderate	Low
SCAFFOLD	2x	Strong	Moderate
FedNova	1x	Good (variable compute)	Low
FedAdam	1x	Good	Low

Model Compression​

Post-training quantization​

Pruning​

Knowledge distillation​

Variance Reduction​

SCAFFOLD​

FedProx​

Second-Order Methods​

FedNova​

Natural gradient methods​

Comparison​