Byzantine-Robust FL: Defending Against Malicious Devices

February 2, 2026 · 8 min read

Federated learning has an adversary problem.

When training across thousands or millions of devices, you can't trust everyone. Some devices may be:

Compromised by malware
Malicious (intentionally poisoning the model)
Faulty (hardware errors, bugs)
Adversarially motivated (competitors, attackers)

A single malicious device uploading carefully crafted gradients can completely destroy model accuracy. Without defenses, federated learning is vulnerable to Byzantine attacks—named after the Byzantine Generals' Problem where some participants may be traitors.

This post explores Byzantine-robust aggregation methods and how Octomil implements defenses against adversarial devices.

The Byzantine Threat Model

What Can An Attacker Do?

Byzantine adversary capabilities:

Arbitrary gradients: Can send any values (not constrained to follow training algorithm)
Collude: Multiple malicious devices can coordinate attacks
Adaptive: Can observe aggregated model and adjust strategy
Sybil attacks: Can create multiple fake identities

Attack examples:

Label flipping attack:

# Malicious device: Flip all labels
for x, y in local_data:
    y_flipped = 9 - y  # For MNIST: 0→9, 1→8, etc.
    malicious_gradient = compute_gradient(x, y_flipped)

Model poisoning attack:

# Send gradient that moves model toward backdoor
malicious_gradient = sign(target_gradient) * large_magnitude

Convergence attack:

# Send random noise to prevent convergence
malicious_gradient = random_noise(scale=1000)

The Honest Majority Assumption

Byzantine robustness typically assumes:

Total devices: $n$
Byzantine devices: $f < n/2$ (less than half)
Honest devices: $n - f > n/2$ (more than half)

If honest majority holds, Byzantine-robust algorithms can recover the true gradient and converge correctly.

If honest majority doesn't hold, no algorithm can guarantee correctness (impossible by FLP theorem).

Byzantine Robustness Without Coordination

The Challenge in Federated Settings

Centralized Byzantine robustness (classical literature) assumes:

All devices participate every round
Synchronous execution
Server knows $f$ (number of Byzantine devices)

Federated learning is different:

Partial participation: Only subset of devices participate per round
Asynchronous: Devices have different speeds
Unknown $f$ : Don't know how many malicious devices exist

Byzantine Robustness and Partial Participation (Richtárik et al., NeurIPS 2024)

Byzantine robustness and partial participation can be achieved simultaneously: just clip gradient differences¹:

Key insight: Instead of clipping absolute gradients, clip gradient differences:

clipped_diff = clip(g_i - g_ref, τ)

where g_ref is a reference gradient (e.g., from previous round).

Why this works:

Honest devices have small gradient differences (smooth optimization)
Byzantine devices have large gradient differences (arbitrary values)
Clipping removes Byzantine influence while preserving honest signal

Algorithm (simplified):

def byzantine_robust_aggregate(gradients, reference_grad, clip_norm):
    clipped_diffs = []

    for grad in gradients:
        # Compute difference from reference
        diff = grad - reference_grad

        # Clip to bound magnitude
        if norm(diff) > clip_norm:
            diff = clip_norm * diff / norm(diff)

        clipped_diffs.append(diff)

    # Average clipped differences
    avg_diff = mean(clipped_diffs)

    # New gradient
    return reference_grad + avg_diff

Theoretical guarantee: Converges to optimum even with $f < n/2$ Byzantine devices.

import octomil

# Byzantine-robust FL with partial participation
client = octomil.OctomilClient(
    project_id="robust-fl",
    byzantine_defense="clipping-vr",  # Clipping with variance reduction

    # Clipping threshold
    clip_threshold=1.0,  # Auto-tuned by default

    # Partial participation
    participation_rate=0.1,  # 10% per round

    # Assumes up to 30% Byzantine devices
    max_byzantine_fraction=0.3
)

client.train(
    model=my_model,
    rounds=100
)

Communication-Efficient Byzantine Robustness

The Compression-Robustness Tension

Problem: Communication compression (top-k sparsification, quantization) can amplify Byzantine attacks.

Example:

Honest device sends top-1% largest gradients
Byzantine device sends malicious values in top-1%
Server can't distinguish (both look sparse)

Byzantine Robust Learning with Compression (Richtárik et al., AISTATS 2024)

Communication compression for Byzantine robust learning² provides first methods that achieve both:

Communication efficiency (compression)
Byzantine robustness (malicious device tolerance)

Key algorithms:

Byz-VR-MARINA: MARINA + Byzantine robustness
Byz-DASHA-PAGE: PAGE + Byzantine robustness
Byz-EF21: Error feedback + Byzantine robustness
Byz-EF21-BC: EF21 with broadcast compression

Technique: Robust aggregation of compressed updates:

Devices compress gradients (top-k, quantization)
Server applies robust aggregation (median, trimmed mean, clipping)
Variance reduction maintains convergence

# Byzantine-robust + compressed FL
client = octomil.OctomilClient(
    project_id="compressed-robust-fl",

    # Byzantine defense
    byzantine_defense="byz-vr-marina",

    # Communication compression
    compression="top-k",
    sparsity=0.01,  # 1% sparsity

    # Combines robustness + efficiency
    max_byzantine_fraction=0.2
)

# Achieves:
# - 100× communication compression
# - Robustness to 20% malicious devices
# - Provable convergence

Clipping with Fast Rates

Double Momentum and Error Feedback (Richtárik et al., 2025)

Problem: Gradient clipping (for Byzantine robustness or DP) slows convergence.

Double momentum and error feedback for clipping with fast rates and differential privacy³ provides:

Fast convergence despite clipping (O(1/T) instead of O(1/√T))
Byzantine robustness from clipping
Differential privacy as bonus (clipping → DP)

Algorithm (Clip21-SGD2M):

First momentum: Accelerates convergence
Second momentum: Variance reduction
Error feedback: Compensates for clipping bias

# Fast Byzantine-robust learning with double momentum
client = octomil.OctomilClient(
    algorithm="clip21-sgd2m",

    # Clipping for robustness
    clip_norm=1.0,

    # Double momentum
    momentum1=0.9,  # Standard momentum
    momentum2=0.99,  # Variance reduction momentum

    # Error feedback
    error_feedback=True
)

# Converges fast despite clipping

Partial Participation Challenges

Byz-VR-MARINA-PP (Richtárik et al., NeurIPS 2024)

Byzantine robustness and partial participation can be achieved simultaneously⁴ introduces Byz-VR-MARINA-PP:

Key challenges with partial participation:

Different devices participate each round
Can't compare gradients across rounds directly
Byzantine devices can time attacks when they participate

Solution: Maintain device-specific reference gradients, use clipping + variance reduction

# Byzantine-robust with partial participation
client = octomil.OctomilClient(
    algorithm="byz-vr-marina-pp",

    # Partial participation
    participation_rate=0.05,  # 5% per round
    devices_per_round=100,

    # Byzantine robustness
    max_byzantine_fraction=0.3,  # Up to 30% malicious

    # Variance reduction for efficiency
    reference_update_frequency=10
)

Robust Aggregation Methods

Classic Robust Aggregators

Coordinate-wise median:

[\text{median}(g)]_j = \text{median}(g_1[j], g_2[j], ..., g_n[j])

Pros: Simple, robust to $< 50\%$ Byzantine Cons: Coordinate-wise (ignores correlations), slow for high dimensions

Trimmed mean:

Sort values per coordinate
Remove top/bottom $\beta$ fraction
Average remaining

Pros: Faster than median, tunable robustness Cons: Still coordinate-wise

Krum (and Multi-Krum):

Compute pairwise distances between all gradients
Select gradient with smallest sum of distances to nearest neighbors
Use that gradient (or average of top-k)

Pros: Considers full gradient geometry Cons: O(n²) complexity, requires knowing $f$

Octomil's Robust Aggregators

# Choose robust aggregation method
client = octomil.OctomilClient(
    byzantine_defense="coordinate-median",  # or "trimmed-mean", "krum", "clipping-vr"

    # Method-specific params
    aggregation_config={
        # For trimmed mean
        "trim_fraction": 0.2,  # Remove top/bottom 20%

        # For Krum
        "krum_k": 10,  # Average top-10 closest gradients

        # For clipping
        "clip_norm": 1.0,  # Clip threshold
    }
)

Attacks and Defenses

Known Attack Strategies

A Little is Enough (Lie) Attack:

Byzantine devices scale gradients by large factor
Moves model toward attacker's objective

Defense: Clipping, robust aggregation

Backdoor Attack:

Train model to misclassify specific input patterns
E.g., "Any image with small sticker in corner → classify as attacker's target"

Defense: Anomaly detection, input validation

Model Replacement Attack:

Byzantine devices replace entire model with malicious version
Amplified by partial participation (can wait for favorable round)

Defense: Byzantine-robust aggregation with partial participation safeguards

Octomil's Multi-Layer Defense

# Defense in depth
client = octomil.OctomilClient(
    project_id="high-security-fl",

    # Layer 1: Byzantine-robust aggregation
    byzantine_defense="byz-vr-marina",
    max_byzantine_fraction=0.3,

    # Layer 2: Anomaly detection
    anomaly_detection=True,
    anomaly_threshold=3.0,  # Standard deviations

    # Layer 3: Reputation system
    device_reputation=True,
    reputation_decay=0.95,  # Penalize past bad behavior

    # Layer 4: Certified robustness
    certified_robustness=True,  # Provable guarantees

    # Layer 5: Audit logging
    audit_logging=True,
    suspicious_device_alerts=True
)

When Byzantine Robustness Matters

Application	Byzantine Risk	Defense Priority
Public cross-device FL	High (open participation)	Critical
Enterprise cross-silo	Low (trusted organizations)	Low
Medical federation	Medium (some hospitals may have bugs)	Medium
Financial FL	High (adversarial incentives)	Critical
IoT sensor networks	High (easy to compromise devices)	Critical
Research collaborations	Low (trusted partners)	Low

Rule of thumb: If any untrusted party can participate, you need Byzantine defenses.

Performance Impact

Overhead of Byzantine Defenses

Computational overhead:

Clipping: +5-10% per round
Coordinate median: +20-30% per round
Krum: +50-100% per round (O(n²) distances)
Clipping + VR: +10-15% per round

Convergence impact:

With 0% Byzantine: 5-10% slower convergence (defensive overhead)
With 20% Byzantine: 2-3× faster convergence vs. no defense (avoids poisoning)
With 40% Byzantine: May not converge without defense

Tradeoff: Small cost when no attack, massive benefit when attacked.

Real-World Performance

Case Study: Public keyboard prediction FL

Setup: 1M devices, open participation, unknown Byzantine fraction

Results without defense:

5% Byzantine devices present
Model accuracy degraded from 87% to 23% after 100 rounds
Complete failure

Results with Byz-VR-MARINA:

Same 5% Byzantine devices
Model accuracy: 85% (only 2% degradation)
10% computational overhead
Success: Robust to attack

Getting Started

curl -fsSL https://get.octomil.com | sh

# Initialize with Byzantine defenses
octomil init secure-project --byzantine-defense clipping-vr

# Train with robustness guarantees
octomil train \
    --byzantine-defense byz-vr-marina \
    --max-byzantine-fraction 0.3 \
    --certified-robustness

# Monitor for attacks
octomil monitor --alert-suspicious-devices

See our Advanced FL Strategies guide for detailed threat models and defense strategies.

References

Malinovsky, G., Horváth, S., Burlachenko, K., & Richtárik, P. (2024). Byzantine robustness and partial participation can be achieved simultaneously: Just clip gradient differences. NeurIPS 2024. arXiv:2311.14127 ↩
Rammal, A., Gruntkowska, K., Fedin, N., Gorbunov, E., & Richtárik, P. (2024). Communication compression for Byzantine robust learning: New efficient algorithms and improved rates. AISTATS 2024. arXiv:2310.09804 ↩
Islamov, R., Horváth, S., Lucchi, A., Richtárik, P., & Gorbunov, E. (2025). Double momentum and error feedback for clipping with fast rates and differential privacy. arXiv:2502.11682 ↩
Malinovsky, G., Horváth, S., Burlachenko, K., & Richtárik, P. (2024). Byzantine robustness and partial participation can be achieved simultaneously. NeurIPS 2024. arXiv:2311.14127 ↩

The Byzantine Threat Model​

What Can An Attacker Do?​

The Honest Majority Assumption​

Byzantine Robustness Without Coordination​

The Challenge in Federated Settings​

Byzantine Robustness and Partial Participation (Richtárik et al., NeurIPS 2024)​

Communication-Efficient Byzantine Robustness​

The Compression-Robustness Tension​

Byzantine Robust Learning with Compression (Richtárik et al., AISTATS 2024)​

Clipping with Fast Rates​

Double Momentum and Error Feedback (Richtárik et al., 2025)​

Partial Participation Challenges​

Byz-VR-MARINA-PP (Richtárik et al., NeurIPS 2024)​

Robust Aggregation Methods​

Classic Robust Aggregators​

Octomil's Robust Aggregators​

Attacks and Defenses​

Known Attack Strategies​

Octomil's Multi-Layer Defense​

When Byzantine Robustness Matters​

Performance Impact​

Overhead of Byzantine Defenses​

Real-World Performance​

Getting Started​

References​

Footnotes​