Skip to main content

Beyond Cross-Entropy: Federated AUC Maximization and X-Risk Optimization

· 9 min read

The default objective in machine learning is minimizing cross-entropy loss. It's what PyTorch uses out-of-the-box. It's what most federated learning papers optimize. It's simple, well-understood, and works great for balanced classification problems.

But real-world applications rarely have balanced classes or standard objectives.

  • Medical diagnosis: 1% of patients have disease → 99% accuracy means predicting "no disease" for everyone
  • Fraud detection: 0.1% of transactions are fraudulent → standard loss fails
  • Financial risk: We care about tail risk (the 1% worst-case scenarios), not average loss
  • Imbalanced federated data: Some devices have only positive examples, others only negatives

This post explores specialized optimization objectives for federated learning and how Octomil supports them through recent breakthroughs from Guo, Yang, and collaborators.

The Problem with Cross-Entropy

When Standard Loss Fails

Example: Cancer detection from medical images

Dataset:

  • 10,000 patients total
  • 100 have cancer (1%)
  • 9,900 are healthy (99%)

Model trained with cross-entropy:

# Standard training
model = train_classifier(data, loss="cross_entropy")
predictions = model.predict(test_data)

# Results:
# Accuracy: 99.2% ✓ (looks great!)
# But actually: Model predicts "no cancer" for everyone
# True positive rate: 0% ✗ (catastrophic failure)

Why it fails: Cross-entropy optimizes average accuracy. With 99% negatives, the optimal solution is to always predict negative.

What We Actually Care About

For imbalanced problems, we need metrics that focus on ranking quality:

AUC (Area Under the ROC Curve):

  • Measures how well the model ranks positives above negatives
  • AUC = 1.0: Perfect ranking (all positives scored higher than negatives)
  • AUC = 0.5: Random ranking
  • Key property: Invariant to class imbalance

X-Risk (Tail risk measures):

  • Focus on worst-case scenarios (e.g., 95th percentile loss)
  • Critical for safety applications, financial risk management
  • Standard: Expected loss, X-risk: Conditional Value at Risk (CVaR)

Federated AUC Maximization

The Challenge

Naive approach: Each device optimizes local AUC, server averages.

Problem: AUC requires comparing all pairs of positive and negative examples:

AUC=1n+niPjN1[f(xi)>f(xj)]\text{AUC} = \frac{1}{n_+ n_-} \sum_{i \in P} \sum_{j \in N} \mathbb{1}[f(x_i) > f(x_j)]

In federated learning:

  • Device A has only positive examples
  • Device B has only negative examples
  • Cannot compute local AUC (need examples from both classes)

FeDXL: Deep X-Risk Optimization (Guo et al., ICML 2023)

FeDXL1 provides the first provable federated learning method for deep X-risk optimization, including AUC maximization.

Key insight: Reformulate AUC as a saddle-point problem that can be decomposed across devices.

AUC reformulation:

maxwmina,bE(x+,x)[(1(fw(x+)fw(x)a))++b(fw(x+)fw(x)a)]\max_w \min_{a,b} \mathbb{E}_{(x^+,x^-)} [(1 - (f_w(x^+) - f_w(x^-) - a))_+ + b(f_w(x^+) - f_w(x^-) - a)]

This can be optimized locally even when devices have only one class!

Provable guarantees:

  • Convergence rate: O(1/√T) for non-convex deep networks
  • Communication complexity: Same as FedAvg (no overhead for AUC objective)
  • Works with heterogeneous data (different class distributions per device)
import octomil

# Federated AUC maximization in Octomil
client = octomil.OctomilClient(
project_id="cancer-detection-fl",
objective="auc", # AUC maximization instead of cross-entropy
imbalance_ratio=0.01 # 1% positive class
)

client.train(
model=medical_image_model,
rounds=100
)

# Octomil handles:
# - Local AUC optimization (even with single class per device)
# - Global AUC aggregation
# - Maintains provable convergence (FeDXL algorithm)

Communication-Efficient Federated AUC (Guo et al., TMLR 2026)

Problem: Standard federated AUC requires communicating model updates every round.

Communication-Efficient Federated AUC Maximization with Cyclic Client Participation2 achieves constant communication complexity per device:

Key technique: Cyclic participation

  • Divide devices into K cohorts
  • Each cohort participates once every K rounds
  • Each device's total communication: O(1) (independent of total rounds!)

Result: Train federated AUC model with 100 rounds but each device only communicates 5 times (20× reduction).

# Communication-efficient federated AUC
client = octomil.OctomilClient(
project_id="fraud-detection-fl",
objective="auc",
participation_strategy="cyclic", # Cyclic cohorts
cohort_count=20 # Each device participates 1/20 rounds
)

# Total rounds: 100
# Communication per device: 5 updates (20× reduction)
# AUC performance: Same as full participation

Distributionally Robust Optimization (DRO)

The Distribution Shift Problem

Training distribution: Data from 2020-2023 Deployment distribution: Data from 2024 (COVID aftermath, economic changes)

Standard ML optimizes for average training performance, which fails under distribution shift.

Distributionally Robust Optimization optimizes for worst-case performance over a set of plausible distributions.

Communication-Efficient Federated Group DRO (Guo & Yang, NeurIPS 2024)

Federated Group Distributionally Robust Optimization3 addresses the hardest federated setting:

Challenges:

  1. Data heterogeneity: Each device has different distribution
  2. Group fairness: Ensure good performance for all demographic groups
  3. Communication efficiency: DRO typically requires expensive second-order methods

Key contributions:

  • First communication-efficient algorithm for federated group DRO
  • Handles non-convex objectives (deep networks)
  • Provable convergence with O(ε^-4) communication complexity (optimal)

Group DRO objective:

minwmaxqΔKk=1KqkLk(w)\min_w \max_{q \in \Delta_K} \sum_{k=1}^K q_k \mathcal{L}_k(w)

where Lk(w)\mathcal{L}_k(w) is the loss on group kk (e.g., demographic group, device cluster).

# Federated Group DRO in Octomil
client = octomil.OctomilClient(
project_id="fair-lending-model",
objective="group-dro",

# Define groups (can be demographic, geographic, etc.)
groups={
"group_1": device_ids_1, # e.g., young borrowers
"group_2": device_ids_2, # e.g., older borrowers
"group_3": device_ids_3 # e.g., small business
},

# Optimize for worst-group performance
robustness_parameter=0.1 # CVaR at 90% level
)

client.train(
model=credit_risk_model,
rounds=50
)

# Result: Model performs well even on worst-performing group

X-Risk Optimization for Safety-Critical Applications

Compositional Training for AUC (Yuan et al., ICLR 2022)

Problem: Deep AUC maximization is unstable with standard optimizers.

Compositional Training for End-to-End Deep AUC Maximization4 (Guo co-author) provides stable training:

Key technique: Compositional optimization

  • Decompose AUC objective into inner and outer optimization problems
  • Inner problem: Update auxiliary variables
  • Outer problem: Update model weights

Result: Stable training for deep networks with AUC objective.

# Compositional AUC training
client = octomil.OctomilClient(
project_id="medical-diagnosis",
objective="auc",
training_method="compositional", # Stable compositional training

# AUC-specific hyperparameters
auc_margin=1.0, # Margin for AUC loss
adaptive_margin=True # Adapt margin during training
)

Deep Distributionally Robust Optimization (Guo et al., NeurIPS 2021)

An Online Method for Deep Distributionally Robust Optimization5 enables real-time adaptation to distribution shift:

Setting: Online federated learning where data distribution changes over time

Approach:

  • Maintain uncertainty set around current distribution
  • Optimize for worst-case within uncertainty set
  • Update uncertainty set as new data arrives
# Online DRO for time-varying distributions
client = octomil.OctomilClient(
project_id="stock-prediction-fl",
objective="dro",

# Online setting
online=True,
uncertainty_radius=0.1, # Wasserstein ball radius
adaptation_rate=0.01 # How quickly to adapt to new distributions
)

# As data distribution changes, model adapts robustly
for round_data in streaming_data:
client.train_online(round_data)

Adaptive Optimization for Non-Standard Objectives

Unified Convergence Analysis for Adaptive Optimization (Guo et al., Machine Learning Journal)

Problem: Adam, AdaGrad, and other adaptive optimizers lack convergence guarantees for AUC and X-risk objectives.

Unified Convergence Analysis for Adaptive Optimization with Moving Average Estimator6 provides first rigorous analysis.

Key contributions:

  • Convergence guarantees for Adam-family optimizers with AUC/X-risk objectives
  • Unified framework covering SGD, Adam, AMSGrad, etc.
  • Practical guidelines for learning rate schedules
# Adaptive optimization for AUC
client = octomil.OctomilClient(
project_id="auc-optimization",
objective="auc",

# Adaptive optimizer with guarantees
optimizer="adam",
learning_rate="adaptive", # Auto-tune per Guo et al. theory

# Moving average parameters
beta1=0.9, # First moment
beta2=0.999 # Second moment (standard Adam defaults)
)

When to Use Each Objective

ObjectiveUse WhenExample Applications
Cross-EntropyBalanced classes, standard classificationImage classification, sentiment analysis
AUCImbalanced classes, ranking quality mattersMedical diagnosis, fraud detection, anomaly detection
Group DROFairness across groups, distribution shiftFair lending, demographic parity, multi-region deployment
CVaR (X-Risk)Safety-critical, tail risk mattersFinancial risk, autonomous vehicles, medical decisions
Compositional AUCDeep networks + imbalanced dataDeep medical imaging, large-scale anomaly detection

Octomil's Specialized Objectives Framework

import octomil

# Unified API for specialized objectives
client = octomil.OctomilClient(
project_id="specialized-objective",

# Choose objective
objective="auc", # or "group-dro", "cvar", "compositional-auc"

# Objective-specific configuration
objective_config={
# For AUC
"margin": 1.0,
"imbalance_ratio": 0.01,

# For Group DRO
"groups": group_definitions,
"robustness_level": 0.9, # 90% CVaR

# For X-Risk
"risk_level": 0.95, # 95th percentile
"risk_measure": "cvar" # or "var", "entropic"
},

# Communication efficiency
participation_strategy="cyclic", # For constant communication

# Optimizer
optimizer="adam",
adaptive_optimization=True # Use Guo et al.'s adaptive theory
)

# Train with specialized objective
client.train(
model=my_model,
rounds=100
)

# Evaluate with appropriate metrics
metrics = client.evaluate(
test_data=test_data,
metrics=["auc", "accuracy", "worst_group_loss", "cvar_95"]
)

print(f"AUC: {metrics.auc}")
print(f"Worst group performance: {metrics.worst_group_loss}")
print(f"CVaR (95%): {metrics.cvar_95}")

Real-World Impact

Case Study 1: Medical Imaging (Cancer Detection)

Challenge: 0.5% positive rate (highly imbalanced)

Solution: Federated AUC maximization

  • 50 hospitals, each with 10K images
  • Standard FL (cross-entropy): 99.5% accuracy, 0% sensitivity (useless)
  • FeDXL (AUC): 99.3% accuracy, 87% sensitivity (clinically useful)

Key insight: AUC optimization found cancer cases despite extreme imbalance.

Case Study 2: Financial Fraud Detection

Challenge: 0.1% fraud rate, distribution shift over time

Solution: Online DRO with cyclic participation

  • 1M devices (mobile banking apps)
  • Cyclic participation: Each device communicates 5 times (vs. 100 in standard FL)
  • DRO: Robust to distribution shift (new fraud patterns)
  • Result: 95% fraud detection rate with 20× communication reduction

Case Study 3: Fair Credit Scoring

Challenge: Ensure fairness across demographic groups

Solution: Federated Group DRO

  • 3 demographic groups with different default rates
  • Standard FL: 85% average accuracy, 65% worst-group accuracy (unfair)
  • Group DRO: 82% average accuracy, 79% worst-group accuracy (much fairer)

Key insight: Small average accuracy drop for large fairness gain.

Research Frontiers

Ongoing work:

  1. Multi-objective optimization: Jointly optimize AUC + fairness + privacy
  2. Federated learning to rank: Beyond binary classification to ranking problems
  3. Adaptive risk levels: Learn optimal risk level (CVaR α) from data
  4. Hierarchical X-risk: Nested risk measures for complex decision hierarchies

Getting Started

pip install octomil

# Initialize with AUC objective
octomil init fraud-detection \
--objective auc \
--imbalance-ratio 0.001

# Train with cyclic participation
octomil train \
--objective auc \
--participation cyclic \
--cohorts 20

See our Advanced FL Strategies guide for detailed tutorials.


References

Footnotes

  1. Guo, Z., Jin, R., Luo, J., & Yang, T. (2023). FeDXL: Provable federated learning for deep X-risk optimization. International Conference on Machine Learning (ICML). arXiv:2210.14396

  2. Vangapally, U., Wu, W., Chen, C., & Guo, Z. (2026). Communication-efficient federated AUC maximization with cyclic client participation. Transactions on Machine Learning Research (TMLR). arXiv:2601.01649

  3. Guo, Z. & Yang, T. (2024). Communication-efficient federated group distributionally robust optimization. Advances in Neural Information Processing Systems (NeurIPS). arXiv:2410.06369

  4. Yuan, Z., Guo, Z., Chawla, N., & Yang, T. (2022). Compositional training for end-to-end deep AUC maximization. International Conference on Learning Representations (ICLR) (Spotlight). OpenReview

  5. Qi, Q., Guo, Z., Xu, Y., Jin, R., & Yang, T. (2021). An online method for deep distributionally robust optimization. Advances in Neural Information Processing Systems (NeurIPS). arXiv:2006.10138

  6. Guo, Z., Xu, Y., Yin, W., Jin, R., & Yang, T. (2021). Unified convergence analysis for adaptive optimization with moving average estimator. Machine Learning Journal. arXiv:2104.14840