Monitoring Dashboard
The Monitoring page (/monitoring) is an operational command center for federated learning reliability. It combines SLO status, infrastructure health, device fleet readiness, training quality, alerting, and incident response in one view.
What You See
The page is organized as stacked panels instead of the older sidebar/section layout:
- Global Health Score card
- Reliability Snapshot
- Fleet Readiness
- Training Operations
- Incident Command
- Regional Health
- Model Risk and Drift
- Rollout Risk
- Alert Rules
- Readiness by Model and Region
Global Controls
Top-right controls:
- Time Range picker
- Auto Refresh toggle
- Refresh Now button
Time Range
The range picker supports:
Past 24hPast 7dPast 14dPast 30dPast 90dPast 1yCustom...(start date to now)
Notes:
- Monitoring aggregate APIs are day-window based (
days/window_days). Custom...is converted to a day-window from the selected start date to now.
Global Health Score
The score is shown as 0-100 with status bands:
- excellent:
>= 90 - good:
>= 75 - warning:
>= 60 - critical:
< 60
The score blends multiple weighted signals:
- SLO attainment
- Infrastructure status
- SDK stability
- Incident pressure
- Round completion
Safeguards:
- Handles missing/invalid slices without forcing misleading values.
- Avoids optimistic rounding to perfect
100. - If SLO data exists, overall score is capped by normalized SLO performance.
KPI Strip
The top KPI strip includes:
- SLO Attainment
- Open Incidents
- Crash Rate
- Round Completion
- Data Freshness (last successful aggregate fetch)
- Time Range
- Partial Failures (count of failed API calls in latest batch refresh)
Panel Reference
Reliability Snapshot
Focuses on platform reliability indicators:
- Error budget spent
- Error logs (24h)
- Database latency
- Storage type
Fleet Readiness
Shows operational device readiness:
- Total devices
- Online in last hour
- Training eligible percentage
- SDK error rate
Training Operations
Tracks experiment execution state:
- Active rounds
- Failed rounds
- Active rollouts
- Models tracked
Incident Command
Triage-first incident table with:
- Open count
- MTTR
- Range label
- Incident list (title, severity, status, owner, created time)
Regional Health
Top regions by fleet size and key performance:
- Region
- Latency
- Availability
Model Risk and Drift
Most at-risk models by drift score:
- Model status
- Accuracy
- Latency
- Drift score
Rollout Risk
Active rollout quality check:
- Model/version
- Canary vs rollout stage
- 1-hour error rate
Alert Rules
Policy coverage summary:
- Rule name
- Severity
- Trigger count
Where rule behavior is defined:
- Settings → Integrations: Configure outbound channels (Slack, Email, Webhook, SIEM/PagerDuty)
- Settings → Alerts: Create/edit alert rules and channel routing
Smart-fill metric presets in the alert form include:
- SLO achieved ratio (
slo_achieved) - DB latency (
infra_database_latency_ms) - SDK crash/error rates (
sdk_crash_rate_pct,sdk_error_rate_pct) - Training completion and failed rounds
- Rollout error rate
- Incident MTTR and resolution rate
You can also choose Custom metric and provide metric_source + metric_name.
Supported aggregations:
latest,count,rate,avgp95,p99,median,min,max,pct
Rules evaluate continuously in the background and can open/resolve incidents as conditions change.
Example rule shape used by the API:
{
"name": "Rollout error rate high",
"alert_type": "threshold",
"severity": "error",
"metric_name": "error_rate_pct",
"metric_source": "rollout",
"threshold_operator": "gt",
"threshold_value": 5,
"threshold_duration_minutes": 10,
"filters": {
"aggregation": {
"type": "p95",
"window_minutes": 10
}
},
"notify_channels": ["integration-id-slack", "integration-id-email"]
}
Readiness by Model and Region
Side-by-side confidence signals:
- Model readiness table
- Geo distribution table
Data Inputs
Monitoring consolidates reliability, fleet, training, alert, and incident signals into one client-facing view so operators can make rollout and response decisions quickly.
Production policy:
- The dashboard must show live telemetry only.
- If a metric is unavailable, show
N/Aorunknownwith context instead of synthetic placeholders. - Regional latency is displayed with source-aware labeling so customers can distinguish measured vs estimated values.
Troubleshooting
Health score looks wrong
Check these first:
- SLO value and target
- Incident open count and severity mix
- SDK crash rate
- Round completion rate
The score is intentionally conservative and may be lower than a single strong metric.
Data freshness is old
- Use Refresh Now.
- Confirm backend services are reachable.
- Check Partial Failures; non-zero means some APIs failed in the latest refresh cycle.
Empty panels in narrow ranges
For short ranges (for example 24h), some panels can legitimately be empty if there was no recent activity.
Gotchas
- Health score is intentionally conservative — a single weak signal (high incident count, low SLO) can drag the overall score below what individual metrics suggest. This is by design.
- Time range affects all panels — changing the time range reloads every panel. For large fleets, the 90-day and 1-year ranges may take several seconds to load.
- Alert rules evaluate continuously — rules run in the background even when the dashboard is closed. Alerts fire based on metric conditions, not page visits.
- Regional latency may be estimated — if measured data isn't available for a region, the dashboard shows an estimate with a label. Don't treat estimated values as SLO baselines.
- Partial failures are transient — a non-zero partial failure count usually means one API timed out during the refresh cycle. Hit Refresh Now to retry. Persistent failures indicate a backend issue.
- Model drift scores update daily — drift detection runs on a daily cadence, not real-time. A newly deployed model won't show drift scores until the next daily evaluation.
Related
- Observability — inference and training telemetry
- Audit Logs — event-level audit trail
- Workspace Settings — metrics export, alert routing
- Rollouts — rollout risk panel details
- Experiments — training operations panel details