Skip to main content

Byzantine-Robust FL: Defending Against Malicious Devices

· 8 min read

Federated learning has an adversary problem.

When training across thousands or millions of devices, you can't trust everyone. Some devices may be:

  • Compromised by malware
  • Malicious (intentionally poisoning the model)
  • Faulty (hardware errors, bugs)
  • Adversarially motivated (competitors, attackers)

A single malicious device uploading carefully crafted gradients can completely destroy model accuracy. Without defenses, federated learning is vulnerable to Byzantine attacks—named after the Byzantine Generals' Problem where some participants may be traitors.

This post explores Byzantine-robust aggregation methods and how Octomil implements defenses against adversarial devices.

The Byzantine Threat Model

What Can An Attacker Do?

Byzantine adversary capabilities:

  1. Arbitrary gradients: Can send any values (not constrained to follow training algorithm)
  2. Collude: Multiple malicious devices can coordinate attacks
  3. Adaptive: Can observe aggregated model and adjust strategy
  4. Sybil attacks: Can create multiple fake identities

Attack examples:

Label flipping attack:

# Malicious device: Flip all labels
for x, y in local_data:
y_flipped = 9 - y # For MNIST: 0→9, 1→8, etc.
malicious_gradient = compute_gradient(x, y_flipped)

Model poisoning attack:

# Send gradient that moves model toward backdoor
malicious_gradient = sign(target_gradient) * large_magnitude

Convergence attack:

# Send random noise to prevent convergence
malicious_gradient = random_noise(scale=1000)

The Honest Majority Assumption

Byzantine robustness typically assumes:

  • Total devices: nn
  • Byzantine devices: f<n/2f < n/2 (less than half)
  • Honest devices: nf>n/2n - f > n/2 (more than half)

If honest majority holds, Byzantine-robust algorithms can recover the true gradient and converge correctly.

If honest majority doesn't hold, no algorithm can guarantee correctness (impossible by FLP theorem).

Byzantine Robustness Without Coordination

The Challenge in Federated Settings

Centralized Byzantine robustness (classical literature) assumes:

  • All devices participate every round
  • Synchronous execution
  • Server knows ff (number of Byzantine devices)

Federated learning is different:

  • Partial participation: Only subset of devices participate per round
  • Asynchronous: Devices have different speeds
  • Unknown ff: Don't know how many malicious devices exist

Byzantine Robustness and Partial Participation (Richtárik et al., NeurIPS 2024)

Byzantine robustness and partial participation can be achieved simultaneously: just clip gradient differences1:

Key insight: Instead of clipping absolute gradients, clip gradient differences:

clipped_diff = clip(g_i - g_ref, τ)

where g_ref is a reference gradient (e.g., from previous round).

Why this works:

  • Honest devices have small gradient differences (smooth optimization)
  • Byzantine devices have large gradient differences (arbitrary values)
  • Clipping removes Byzantine influence while preserving honest signal

Algorithm (simplified):

def byzantine_robust_aggregate(gradients, reference_grad, clip_norm):
clipped_diffs = []

for grad in gradients:
# Compute difference from reference
diff = grad - reference_grad

# Clip to bound magnitude
if norm(diff) > clip_norm:
diff = clip_norm * diff / norm(diff)

clipped_diffs.append(diff)

# Average clipped differences
avg_diff = mean(clipped_diffs)

# New gradient
return reference_grad + avg_diff

Theoretical guarantee: Converges to optimum even with f<n/2f < n/2 Byzantine devices.

import octomil

# Byzantine-robust FL with partial participation
client = octomil.OctomilClient(
project_id="robust-fl",
byzantine_defense="clipping-vr", # Clipping with variance reduction

# Clipping threshold
clip_threshold=1.0, # Auto-tuned by default

# Partial participation
participation_rate=0.1, # 10% per round

# Assumes up to 30% Byzantine devices
max_byzantine_fraction=0.3
)

client.train(
model=my_model,
rounds=100
)

Communication-Efficient Byzantine Robustness

The Compression-Robustness Tension

Problem: Communication compression (top-k sparsification, quantization) can amplify Byzantine attacks.

Example:

  • Honest device sends top-1% largest gradients
  • Byzantine device sends malicious values in top-1%
  • Server can't distinguish (both look sparse)

Byzantine Robust Learning with Compression (Richtárik et al., AISTATS 2024)

Communication compression for Byzantine robust learning2 provides first methods that achieve both:

  • Communication efficiency (compression)
  • Byzantine robustness (malicious device tolerance)

Key algorithms:

  • Byz-VR-MARINA: MARINA + Byzantine robustness
  • Byz-DASHA-PAGE: PAGE + Byzantine robustness
  • Byz-EF21: Error feedback + Byzantine robustness
  • Byz-EF21-BC: EF21 with broadcast compression

Technique: Robust aggregation of compressed updates:

  1. Devices compress gradients (top-k, quantization)
  2. Server applies robust aggregation (median, trimmed mean, clipping)
  3. Variance reduction maintains convergence
# Byzantine-robust + compressed FL
client = octomil.OctomilClient(
project_id="compressed-robust-fl",

# Byzantine defense
byzantine_defense="byz-vr-marina",

# Communication compression
compression="top-k",
sparsity=0.01, # 1% sparsity

# Combines robustness + efficiency
max_byzantine_fraction=0.2
)

# Achieves:
# - 100× communication compression
# - Robustness to 20% malicious devices
# - Provable convergence

Clipping with Fast Rates

Double Momentum and Error Feedback (Richtárik et al., 2025)

Problem: Gradient clipping (for Byzantine robustness or DP) slows convergence.

Double momentum and error feedback for clipping with fast rates and differential privacy3 provides:

  • Fast convergence despite clipping (O(1/T) instead of O(1/√T))
  • Byzantine robustness from clipping
  • Differential privacy as bonus (clipping → DP)

Algorithm (Clip21-SGD2M):

  • First momentum: Accelerates convergence
  • Second momentum: Variance reduction
  • Error feedback: Compensates for clipping bias
# Fast Byzantine-robust learning with double momentum
client = octomil.OctomilClient(
algorithm="clip21-sgd2m",

# Clipping for robustness
clip_norm=1.0,

# Double momentum
momentum1=0.9, # Standard momentum
momentum2=0.99, # Variance reduction momentum

# Error feedback
error_feedback=True
)

# Converges fast despite clipping

Partial Participation Challenges

Byz-VR-MARINA-PP (Richtárik et al., NeurIPS 2024)

Byzantine robustness and partial participation can be achieved simultaneously4 introduces Byz-VR-MARINA-PP:

Key challenges with partial participation:

  1. Different devices participate each round
  2. Can't compare gradients across rounds directly
  3. Byzantine devices can time attacks when they participate

Solution: Maintain device-specific reference gradients, use clipping + variance reduction

# Byzantine-robust with partial participation
client = octomil.OctomilClient(
algorithm="byz-vr-marina-pp",

# Partial participation
participation_rate=0.05, # 5% per round
devices_per_round=100,

# Byzantine robustness
max_byzantine_fraction=0.3, # Up to 30% malicious

# Variance reduction for efficiency
reference_update_frequency=10
)

Robust Aggregation Methods

Classic Robust Aggregators

Coordinate-wise median:

[median(g)]j=median(g1[j],g2[j],...,gn[j])[\text{median}(g)]_j = \text{median}(g_1[j], g_2[j], ..., g_n[j])

Pros: Simple, robust to <50%< 50\% Byzantine Cons: Coordinate-wise (ignores correlations), slow for high dimensions

Trimmed mean:

  1. Sort values per coordinate
  2. Remove top/bottom β\beta fraction
  3. Average remaining

Pros: Faster than median, tunable robustness Cons: Still coordinate-wise

Krum (and Multi-Krum):

  1. Compute pairwise distances between all gradients
  2. Select gradient with smallest sum of distances to nearest neighbors
  3. Use that gradient (or average of top-k)

Pros: Considers full gradient geometry Cons: O(n²) complexity, requires knowing ff

Octomil's Robust Aggregators

# Choose robust aggregation method
client = octomil.OctomilClient(
byzantine_defense="coordinate-median", # or "trimmed-mean", "krum", "clipping-vr"

# Method-specific params
aggregation_config={
# For trimmed mean
"trim_fraction": 0.2, # Remove top/bottom 20%

# For Krum
"krum_k": 10, # Average top-10 closest gradients

# For clipping
"clip_norm": 1.0, # Clip threshold
}
)

Attacks and Defenses

Known Attack Strategies

A Little is Enough (Lie) Attack:

  • Byzantine devices scale gradients by large factor
  • Moves model toward attacker's objective

Defense: Clipping, robust aggregation

Backdoor Attack:

  • Train model to misclassify specific input patterns
  • E.g., "Any image with small sticker in corner → classify as attacker's target"

Defense: Anomaly detection, input validation

Model Replacement Attack:

  • Byzantine devices replace entire model with malicious version
  • Amplified by partial participation (can wait for favorable round)

Defense: Byzantine-robust aggregation with partial participation safeguards

Octomil's Multi-Layer Defense

# Defense in depth
client = octomil.OctomilClient(
project_id="high-security-fl",

# Layer 1: Byzantine-robust aggregation
byzantine_defense="byz-vr-marina",
max_byzantine_fraction=0.3,

# Layer 2: Anomaly detection
anomaly_detection=True,
anomaly_threshold=3.0, # Standard deviations

# Layer 3: Reputation system
device_reputation=True,
reputation_decay=0.95, # Penalize past bad behavior

# Layer 4: Certified robustness
certified_robustness=True, # Provable guarantees

# Layer 5: Audit logging
audit_logging=True,
suspicious_device_alerts=True
)

When Byzantine Robustness Matters

ApplicationByzantine RiskDefense Priority
Public cross-device FLHigh (open participation)Critical
Enterprise cross-siloLow (trusted organizations)Low
Medical federationMedium (some hospitals may have bugs)Medium
Financial FLHigh (adversarial incentives)Critical
IoT sensor networksHigh (easy to compromise devices)Critical
Research collaborationsLow (trusted partners)Low

Rule of thumb: If any untrusted party can participate, you need Byzantine defenses.

Performance Impact

Overhead of Byzantine Defenses

Computational overhead:

  • Clipping: +5-10% per round
  • Coordinate median: +20-30% per round
  • Krum: +50-100% per round (O(n²) distances)
  • Clipping + VR: +10-15% per round

Convergence impact:

  • With 0% Byzantine: 5-10% slower convergence (defensive overhead)
  • With 20% Byzantine: 2-3× faster convergence vs. no defense (avoids poisoning)
  • With 40% Byzantine: May not converge without defense

Tradeoff: Small cost when no attack, massive benefit when attacked.

Real-World Performance

Case Study: Public keyboard prediction FL

Setup: 1M devices, open participation, unknown Byzantine fraction

Results without defense:

  • 5% Byzantine devices present
  • Model accuracy degraded from 87% to 23% after 100 rounds
  • Complete failure

Results with Byz-VR-MARINA:

  • Same 5% Byzantine devices
  • Model accuracy: 85% (only 2% degradation)
  • 10% computational overhead
  • Success: Robust to attack

Getting Started

pip install octomil

# Initialize with Byzantine defenses
octomil init secure-project --byzantine-defense clipping-vr

# Train with robustness guarantees
octomil train \
--byzantine-defense byz-vr-marina \
--max-byzantine-fraction 0.3 \
--certified-robustness

# Monitor for attacks
octomil monitor --alert-suspicious-devices

See our Advanced FL Strategies guide for detailed threat models and defense strategies.


References

Footnotes

  1. Malinovsky, G., Horváth, S., Burlachenko, K., & Richtárik, P. (2024). Byzantine robustness and partial participation can be achieved simultaneously: Just clip gradient differences. NeurIPS 2024. arXiv:2311.14127

  2. Rammal, A., Gruntkowska, K., Fedin, N., Gorbunov, E., & Richtárik, P. (2024). Communication compression for Byzantine robust learning: New efficient algorithms and improved rates. AISTATS 2024. arXiv:2310.09804

  3. Islamov, R., Horváth, S., Lucchi, A., Richtárik, P., & Gorbunov, E. (2025). Double momentum and error feedback for clipping with fast rates and differential privacy. arXiv:2502.11682

  4. Malinovsky, G., Horváth, S., Burlachenko, K., & Richtárik, P. (2024). Byzantine robustness and partial participation can be achieved simultaneously. NeurIPS 2024. arXiv:2311.14127