Beyond Cross-Entropy: Federated AUC Maximization and X-Risk Optimization
The default objective in machine learning is minimizing cross-entropy loss. It's what PyTorch uses out-of-the-box. It's what most federated learning papers optimize. It's simple, well-understood, and works great for balanced classification problems.
But real-world applications rarely have balanced classes or standard objectives.
- Medical diagnosis: 1% of patients have disease → 99% accuracy means predicting "no disease" for everyone
- Fraud detection: 0.1% of transactions are fraudulent → standard loss fails
- Financial risk: We care about tail risk (the 1% worst-case scenarios), not average loss
- Imbalanced federated data: Some devices have only positive examples, others only negatives
This post explores specialized optimization objectives for federated learning and how Octomil supports them through recent breakthroughs from Guo, Yang, and collaborators.
The Problem with Cross-Entropy
When Standard Loss Fails
Example: Cancer detection from medical images
Dataset:
- 10,000 patients total
- 100 have cancer (1%)
- 9,900 are healthy (99%)
Model trained with cross-entropy:
# Standard training
model = train_classifier(data, loss="cross_entropy")
predictions = model.predict(test_data)
# Results:
# Accuracy: 99.2% ✓ (looks great!)
# But actually: Model predicts "no cancer" for everyone
# True positive rate: 0% ✗ (catastrophic failure)
Why it fails: Cross-entropy optimizes average accuracy. With 99% negatives, the optimal solution is to always predict negative.
What We Actually Care About
For imbalanced problems, we need metrics that focus on ranking quality:
AUC (Area Under the ROC Curve):
- Measures how well the model ranks positives above negatives
- AUC = 1.0: Perfect ranking (all positives scored higher than negatives)
- AUC = 0.5: Random ranking
- Key property: Invariant to class imbalance
X-Risk (Tail risk measures):
- Focus on worst-case scenarios (e.g., 95th percentile loss)
- Critical for safety applications, financial risk management
- Standard: Expected loss, X-risk: Conditional Value at Risk (CVaR)
Federated AUC Maximization
The Challenge
Naive approach: Each device optimizes local AUC, server averages.
Problem: AUC requires comparing all pairs of positive and negative examples:
In federated learning:
- Device A has only positive examples
- Device B has only negative examples
- Cannot compute local AUC (need examples from both classes)
FeDXL: Deep X-Risk Optimization (Guo et al., ICML 2023)
FeDXL1 provides the first provable federated learning method for deep X-risk optimization, including AUC maximization.
Key insight: Reformulate AUC as a saddle-point problem that can be decomposed across devices.
AUC reformulation:
This can be optimized locally even when devices have only one class!
Provable guarantees:
- Convergence rate: O(1/√T) for non-convex deep networks
- Communication complexity: Same as FedAvg (no overhead for AUC objective)
- Works with heterogeneous data (different class distributions per device)
import octomil
# Federated AUC maximization in Octomil
client = octomil.OctomilClient(
project_id="cancer-detection-fl",
objective="auc", # AUC maximization instead of cross-entropy
imbalance_ratio=0.01 # 1% positive class
)
client.train(
model=medical_image_model,
rounds=100
)
# Octomil handles:
# - Local AUC optimization (even with single class per device)
# - Global AUC aggregation
# - Maintains provable convergence (FeDXL algorithm)
Communication-Efficient Federated AUC (Guo et al., TMLR 2026)
Problem: Standard federated AUC requires communicating model updates every round.
Communication-Efficient Federated AUC Maximization with Cyclic Client Participation2 achieves constant communication complexity per device:
Key technique: Cyclic participation
- Divide devices into K cohorts
- Each cohort participates once every K rounds
- Each device's total communication: O(1) (independent of total rounds!)
Result: Train federated AUC model with 100 rounds but each device only communicates 5 times (20× reduction).
# Communication-efficient federated AUC
client = octomil.OctomilClient(
project_id="fraud-detection-fl",
objective="auc",
participation_strategy="cyclic", # Cyclic cohorts
cohort_count=20 # Each device participates 1/20 rounds
)
# Total rounds: 100
# Communication per device: 5 updates (20× reduction)
# AUC performance: Same as full participation
Distributionally Robust Optimization (DRO)
The Distribution Shift Problem
Training distribution: Data from 2020-2023 Deployment distribution: Data from 2024 (COVID aftermath, economic changes)
Standard ML optimizes for average training performance, which fails under distribution shift.
Distributionally Robust Optimization optimizes for worst-case performance over a set of plausible distributions.
Communication-Efficient Federated Group DRO (Guo & Yang, NeurIPS 2024)
Federated Group Distributionally Robust Optimization3 addresses the hardest federated setting:
Challenges:
- Data heterogeneity: Each device has different distribution
- Group fairness: Ensure good performance for all demographic groups
- Communication efficiency: DRO typically requires expensive second-order methods
Key contributions:
- First communication-efficient algorithm for federated group DRO
- Handles non-convex objectives (deep networks)
- Provable convergence with O(ε^-4) communication complexity (optimal)
Group DRO objective:
where is the loss on group (e.g., demographic group, device cluster).
# Federated Group DRO in Octomil
client = octomil.OctomilClient(
project_id="fair-lending-model",
objective="group-dro",
# Define groups (can be demographic, geographic, etc.)
groups={
"group_1": device_ids_1, # e.g., young borrowers
"group_2": device_ids_2, # e.g., older borrowers
"group_3": device_ids_3 # e.g., small business
},
# Optimize for worst-group performance
robustness_parameter=0.1 # CVaR at 90% level
)
client.train(
model=credit_risk_model,
rounds=50
)
# Result: Model performs well even on worst-performing group
X-Risk Optimization for Safety-Critical Applications
Compositional Training for AUC (Yuan et al., ICLR 2022)
Problem: Deep AUC maximization is unstable with standard optimizers.
Compositional Training for End-to-End Deep AUC Maximization4 (Guo co-author) provides stable training:
Key technique: Compositional optimization
- Decompose AUC objective into inner and outer optimization problems
- Inner problem: Update auxiliary variables
- Outer problem: Update model weights
Result: Stable training for deep networks with AUC objective.
# Compositional AUC training
client = octomil.OctomilClient(
project_id="medical-diagnosis",
objective="auc",
training_method="compositional", # Stable compositional training
# AUC-specific hyperparameters
auc_margin=1.0, # Margin for AUC loss
adaptive_margin=True # Adapt margin during training
)
Deep Distributionally Robust Optimization (Guo et al., NeurIPS 2021)
An Online Method for Deep Distributionally Robust Optimization5 enables real-time adaptation to distribution shift:
Setting: Online federated learning where data distribution changes over time
Approach:
- Maintain uncertainty set around current distribution
- Optimize for worst-case within uncertainty set
- Update uncertainty set as new data arrives
# Online DRO for time-varying distributions
client = octomil.OctomilClient(
project_id="stock-prediction-fl",
objective="dro",
# Online setting
online=True,
uncertainty_radius=0.1, # Wasserstein ball radius
adaptation_rate=0.01 # How quickly to adapt to new distributions
)
# As data distribution changes, model adapts robustly
for round_data in streaming_data:
client.train_online(round_data)
Adaptive Optimization for Non-Standard Objectives
Unified Convergence Analysis for Adaptive Optimization (Guo et al., Machine Learning Journal)
Problem: Adam, AdaGrad, and other adaptive optimizers lack convergence guarantees for AUC and X-risk objectives.
Unified Convergence Analysis for Adaptive Optimization with Moving Average Estimator6 provides first rigorous analysis.
Key contributions:
- Convergence guarantees for Adam-family optimizers with AUC/X-risk objectives
- Unified framework covering SGD, Adam, AMSGrad, etc.
- Practical guidelines for learning rate schedules
# Adaptive optimization for AUC
client = octomil.OctomilClient(
project_id="auc-optimization",
objective="auc",
# Adaptive optimizer with guarantees
optimizer="adam",
learning_rate="adaptive", # Auto-tune per Guo et al. theory
# Moving average parameters
beta1=0.9, # First moment
beta2=0.999 # Second moment (standard Adam defaults)
)
When to Use Each Objective
| Objective | Use When | Example Applications |
|---|---|---|
| Cross-Entropy | Balanced classes, standard classification | Image classification, sentiment analysis |
| AUC | Imbalanced classes, ranking quality matters | Medical diagnosis, fraud detection, anomaly detection |
| Group DRO | Fairness across groups, distribution shift | Fair lending, demographic parity, multi-region deployment |
| CVaR (X-Risk) | Safety-critical, tail risk matters | Financial risk, autonomous vehicles, medical decisions |
| Compositional AUC | Deep networks + imbalanced data | Deep medical imaging, large-scale anomaly detection |
Octomil's Specialized Objectives Framework
import octomil
# Unified API for specialized objectives
client = octomil.OctomilClient(
project_id="specialized-objective",
# Choose objective
objective="auc", # or "group-dro", "cvar", "compositional-auc"
# Objective-specific configuration
objective_config={
# For AUC
"margin": 1.0,
"imbalance_ratio": 0.01,
# For Group DRO
"groups": group_definitions,
"robustness_level": 0.9, # 90% CVaR
# For X-Risk
"risk_level": 0.95, # 95th percentile
"risk_measure": "cvar" # or "var", "entropic"
},
# Communication efficiency
participation_strategy="cyclic", # For constant communication
# Optimizer
optimizer="adam",
adaptive_optimization=True # Use Guo et al.'s adaptive theory
)
# Train with specialized objective
client.train(
model=my_model,
rounds=100
)
# Evaluate with appropriate metrics
metrics = client.evaluate(
test_data=test_data,
metrics=["auc", "accuracy", "worst_group_loss", "cvar_95"]
)
print(f"AUC: {metrics.auc}")
print(f"Worst group performance: {metrics.worst_group_loss}")
print(f"CVaR (95%): {metrics.cvar_95}")
Real-World Impact
Case Study 1: Medical Imaging (Cancer Detection)
Challenge: 0.5% positive rate (highly imbalanced)
Solution: Federated AUC maximization
- 50 hospitals, each with 10K images
- Standard FL (cross-entropy): 99.5% accuracy, 0% sensitivity (useless)
- FeDXL (AUC): 99.3% accuracy, 87% sensitivity (clinically useful)
Key insight: AUC optimization found cancer cases despite extreme imbalance.
Case Study 2: Financial Fraud Detection
Challenge: 0.1% fraud rate, distribution shift over time
Solution: Online DRO with cyclic participation
- 1M devices (mobile banking apps)
- Cyclic participation: Each device communicates 5 times (vs. 100 in standard FL)
- DRO: Robust to distribution shift (new fraud patterns)
- Result: 95% fraud detection rate with 20× communication reduction
Case Study 3: Fair Credit Scoring
Challenge: Ensure fairness across demographic groups
Solution: Federated Group DRO
- 3 demographic groups with different default rates
- Standard FL: 85% average accuracy, 65% worst-group accuracy (unfair)
- Group DRO: 82% average accuracy, 79% worst-group accuracy (much fairer)
Key insight: Small average accuracy drop for large fairness gain.
Research Frontiers
Ongoing work:
- Multi-objective optimization: Jointly optimize AUC + fairness + privacy
- Federated learning to rank: Beyond binary classification to ranking problems
- Adaptive risk levels: Learn optimal risk level (CVaR α) from data
- Hierarchical X-risk: Nested risk measures for complex decision hierarchies
Getting Started
pip install octomil
# Initialize with AUC objective
octomil init fraud-detection \
--objective auc \
--imbalance-ratio 0.001
# Train with cyclic participation
octomil train \
--objective auc \
--participation cyclic \
--cohorts 20
See our Advanced FL Strategies guide for detailed tutorials.