Byzantine-Robust FL: Defending Against Malicious Devices
Federated learning has an adversary problem.
When training across thousands or millions of devices, you can't trust everyone. Some devices may be:
- Compromised by malware
- Malicious (intentionally poisoning the model)
- Faulty (hardware errors, bugs)
- Adversarially motivated (competitors, attackers)
A single malicious device uploading carefully crafted gradients can completely destroy model accuracy. Without defenses, federated learning is vulnerable to Byzantine attacks—named after the Byzantine Generals' Problem where some participants may be traitors.
This post explores Byzantine-robust aggregation methods and how Octomil implements defenses against adversarial devices.
The Byzantine Threat Model
What Can An Attacker Do?
Byzantine adversary capabilities:
- Arbitrary gradients: Can send any values (not constrained to follow training algorithm)
- Collude: Multiple malicious devices can coordinate attacks
- Adaptive: Can observe aggregated model and adjust strategy
- Sybil attacks: Can create multiple fake identities
Attack examples:
Label flipping attack:
# Malicious device: Flip all labels
for x, y in local_data:
y_flipped = 9 - y # For MNIST: 0→9, 1→8, etc.
malicious_gradient = compute_gradient(x, y_flipped)
Model poisoning attack:
# Send gradient that moves model toward backdoor
malicious_gradient = sign(target_gradient) * large_magnitude
Convergence attack:
# Send random noise to prevent convergence
malicious_gradient = random_noise(scale=1000)
The Honest Majority Assumption
Byzantine robustness typically assumes:
- Total devices:
- Byzantine devices: (less than half)
- Honest devices: (more than half)
If honest majority holds, Byzantine-robust algorithms can recover the true gradient and converge correctly.
If honest majority doesn't hold, no algorithm can guarantee correctness (impossible by FLP theorem).
Byzantine Robustness Without Coordination
The Challenge in Federated Settings
Centralized Byzantine robustness (classical literature) assumes:
- All devices participate every round
- Synchronous execution
- Server knows (number of Byzantine devices)
Federated learning is different:
- Partial participation: Only subset of devices participate per round
- Asynchronous: Devices have different speeds
- Unknown : Don't know how many malicious devices exist
Byzantine Robustness and Partial Participation (Richtárik et al., NeurIPS 2024)
Byzantine robustness and partial participation can be achieved simultaneously: just clip gradient differences1:
Key insight: Instead of clipping absolute gradients, clip gradient differences:
clipped_diff = clip(g_i - g_ref, τ)
where g_ref is a reference gradient (e.g., from previous round).
Why this works:
- Honest devices have small gradient differences (smooth optimization)
- Byzantine devices have large gradient differences (arbitrary values)
- Clipping removes Byzantine influence while preserving honest signal
Algorithm (simplified):
def byzantine_robust_aggregate(gradients, reference_grad, clip_norm):
clipped_diffs = []
for grad in gradients:
# Compute difference from reference
diff = grad - reference_grad
# Clip to bound magnitude
if norm(diff) > clip_norm:
diff = clip_norm * diff / norm(diff)
clipped_diffs.append(diff)
# Average clipped differences
avg_diff = mean(clipped_diffs)
# New gradient
return reference_grad + avg_diff
Theoretical guarantee: Converges to optimum even with Byzantine devices.
import octomil
# Byzantine-robust FL with partial participation
client = octomil.OctomilClient(
project_id="robust-fl",
byzantine_defense="clipping-vr", # Clipping with variance reduction
# Clipping threshold
clip_threshold=1.0, # Auto-tuned by default
# Partial participation
participation_rate=0.1, # 10% per round
# Assumes up to 30% Byzantine devices
max_byzantine_fraction=0.3
)
client.train(
model=my_model,
rounds=100
)
Communication-Efficient Byzantine Robustness
The Compression-Robustness Tension
Problem: Communication compression (top-k sparsification, quantization) can amplify Byzantine attacks.
Example:
- Honest device sends top-1% largest gradients
- Byzantine device sends malicious values in top-1%
- Server can't distinguish (both look sparse)
Byzantine Robust Learning with Compression (Richtárik et al., AISTATS 2024)
Communication compression for Byzantine robust learning2 provides first methods that achieve both:
- Communication efficiency (compression)
- Byzantine robustness (malicious device tolerance)
Key algorithms:
- Byz-VR-MARINA: MARINA + Byzantine robustness
- Byz-DASHA-PAGE: PAGE + Byzantine robustness
- Byz-EF21: Error feedback + Byzantine robustness
- Byz-EF21-BC: EF21 with broadcast compression
Technique: Robust aggregation of compressed updates:
- Devices compress gradients (top-k, quantization)
- Server applies robust aggregation (median, trimmed mean, clipping)
- Variance reduction maintains convergence
# Byzantine-robust + compressed FL
client = octomil.OctomilClient(
project_id="compressed-robust-fl",
# Byzantine defense
byzantine_defense="byz-vr-marina",
# Communication compression
compression="top-k",
sparsity=0.01, # 1% sparsity
# Combines robustness + efficiency
max_byzantine_fraction=0.2
)
# Achieves:
# - 100× communication compression
# - Robustness to 20% malicious devices
# - Provable convergence
Clipping with Fast Rates
Double Momentum and Error Feedback (Richtárik et al., 2025)
Problem: Gradient clipping (for Byzantine robustness or DP) slows convergence.
Double momentum and error feedback for clipping with fast rates and differential privacy3 provides:
- Fast convergence despite clipping (O(1/T) instead of O(1/√T))
- Byzantine robustness from clipping
- Differential privacy as bonus (clipping → DP)
Algorithm (Clip21-SGD2M):
- First momentum: Accelerates convergence
- Second momentum: Variance reduction
- Error feedback: Compensates for clipping bias
# Fast Byzantine-robust learning with double momentum
client = octomil.OctomilClient(
algorithm="clip21-sgd2m",
# Clipping for robustness
clip_norm=1.0,
# Double momentum
momentum1=0.9, # Standard momentum
momentum2=0.99, # Variance reduction momentum
# Error feedback
error_feedback=True
)
# Converges fast despite clipping
Partial Participation Challenges
Byz-VR-MARINA-PP (Richtárik et al., NeurIPS 2024)
Byzantine robustness and partial participation can be achieved simultaneously4 introduces Byz-VR-MARINA-PP:
Key challenges with partial participation:
- Different devices participate each round
- Can't compare gradients across rounds directly
- Byzantine devices can time attacks when they participate
Solution: Maintain device-specific reference gradients, use clipping + variance reduction
# Byzantine-robust with partial participation
client = octomil.OctomilClient(
algorithm="byz-vr-marina-pp",
# Partial participation
participation_rate=0.05, # 5% per round
devices_per_round=100,
# Byzantine robustness
max_byzantine_fraction=0.3, # Up to 30% malicious
# Variance reduction for efficiency
reference_update_frequency=10
)
Robust Aggregation Methods
Classic Robust Aggregators
Coordinate-wise median:
Pros: Simple, robust to Byzantine Cons: Coordinate-wise (ignores correlations), slow for high dimensions
Trimmed mean:
- Sort values per coordinate
- Remove top/bottom fraction
- Average remaining
Pros: Faster than median, tunable robustness Cons: Still coordinate-wise
Krum (and Multi-Krum):
- Compute pairwise distances between all gradients
- Select gradient with smallest sum of distances to nearest neighbors
- Use that gradient (or average of top-k)
Pros: Considers full gradient geometry Cons: O(n²) complexity, requires knowing
Octomil's Robust Aggregators
# Choose robust aggregation method
client = octomil.OctomilClient(
byzantine_defense="coordinate-median", # or "trimmed-mean", "krum", "clipping-vr"
# Method-specific params
aggregation_config={
# For trimmed mean
"trim_fraction": 0.2, # Remove top/bottom 20%
# For Krum
"krum_k": 10, # Average top-10 closest gradients
# For clipping
"clip_norm": 1.0, # Clip threshold
}
)
Attacks and Defenses
Known Attack Strategies
A Little is Enough (Lie) Attack:
- Byzantine devices scale gradients by large factor
- Moves model toward attacker's objective
Defense: Clipping, robust aggregation
Backdoor Attack:
- Train model to misclassify specific input patterns
- E.g., "Any image with small sticker in corner → classify as attacker's target"
Defense: Anomaly detection, input validation
Model Replacement Attack:
- Byzantine devices replace entire model with malicious version
- Amplified by partial participation (can wait for favorable round)
Defense: Byzantine-robust aggregation with partial participation safeguards
Octomil's Multi-Layer Defense
# Defense in depth
client = octomil.OctomilClient(
project_id="high-security-fl",
# Layer 1: Byzantine-robust aggregation
byzantine_defense="byz-vr-marina",
max_byzantine_fraction=0.3,
# Layer 2: Anomaly detection
anomaly_detection=True,
anomaly_threshold=3.0, # Standard deviations
# Layer 3: Reputation system
device_reputation=True,
reputation_decay=0.95, # Penalize past bad behavior
# Layer 4: Certified robustness
certified_robustness=True, # Provable guarantees
# Layer 5: Audit logging
audit_logging=True,
suspicious_device_alerts=True
)
When Byzantine Robustness Matters
| Application | Byzantine Risk | Defense Priority |
|---|---|---|
| Public cross-device FL | High (open participation) | Critical |
| Enterprise cross-silo | Low (trusted organizations) | Low |
| Medical federation | Medium (some hospitals may have bugs) | Medium |
| Financial FL | High (adversarial incentives) | Critical |
| IoT sensor networks | High (easy to compromise devices) | Critical |
| Research collaborations | Low (trusted partners) | Low |
Rule of thumb: If any untrusted party can participate, you need Byzantine defenses.
Performance Impact
Overhead of Byzantine Defenses
Computational overhead:
- Clipping: +5-10% per round
- Coordinate median: +20-30% per round
- Krum: +50-100% per round (O(n²) distances)
- Clipping + VR: +10-15% per round
Convergence impact:
- With 0% Byzantine: 5-10% slower convergence (defensive overhead)
- With 20% Byzantine: 2-3× faster convergence vs. no defense (avoids poisoning)
- With 40% Byzantine: May not converge without defense
Tradeoff: Small cost when no attack, massive benefit when attacked.
Real-World Performance
Case Study: Public keyboard prediction FL
Setup: 1M devices, open participation, unknown Byzantine fraction
Results without defense:
- 5% Byzantine devices present
- Model accuracy degraded from 87% to 23% after 100 rounds
- Complete failure
Results with Byz-VR-MARINA:
- Same 5% Byzantine devices
- Model accuracy: 85% (only 2% degradation)
- 10% computational overhead
- Success: Robust to attack
Getting Started
pip install octomil
# Initialize with Byzantine defenses
octomil init secure-project --byzantine-defense clipping-vr
# Train with robustness guarantees
octomil train \
--byzantine-defense byz-vr-marina \
--max-byzantine-fraction 0.3 \
--certified-robustness
# Monitor for attacks
octomil monitor --alert-suspicious-devices
See our Advanced FL Strategies guide for detailed threat models and defense strategies.