Skip to main content

Monitoring Dashboard

The Monitoring page (/monitoring) is an operational command center for federated learning reliability. It combines SLO status, infrastructure health, device fleet readiness, training quality, alerting, and incident response in one view.

What You See

The page is organized as stacked panels instead of the older sidebar/section layout:

  • Global Health Score card
  • Reliability Snapshot
  • Fleet Readiness
  • Training Operations
  • Incident Command
  • Regional Health
  • Model Risk and Drift
  • Rollout Risk
  • Alert Rules
  • Readiness by Model and Region

Global Controls

Top-right controls:

  • Time Range picker
  • Auto Refresh toggle
  • Refresh Now button

Time Range

The range picker supports:

  • Past 24h
  • Past 7d
  • Past 14d
  • Past 30d
  • Past 90d
  • Past 1y
  • Custom... (start date to now)

Notes:

  • Monitoring aggregate APIs are day-window based (days / window_days).
  • Custom... is converted to a day-window from the selected start date to now.

Global Health Score

The score is shown as 0-100 with status bands:

  • excellent: >= 90
  • good: >= 75
  • warning: >= 60
  • critical: < 60

The score blends multiple weighted signals:

  • SLO attainment
  • Infrastructure status
  • SDK stability
  • Incident pressure
  • Round completion

Safeguards:

  • Handles missing/invalid slices without forcing misleading values.
  • Avoids optimistic rounding to perfect 100.
  • If SLO data exists, overall score is capped by normalized SLO performance.

KPI Strip

The top KPI strip includes:

  • SLO Attainment
  • Open Incidents
  • Crash Rate
  • Round Completion
  • Data Freshness (last successful aggregate fetch)
  • Time Range
  • Partial Failures (count of failed API calls in latest batch refresh)

Panel Reference

Reliability Snapshot

Focuses on platform reliability indicators:

  • Error budget spent
  • Error logs (24h)
  • Database latency
  • Storage type

Fleet Readiness

Shows operational device readiness:

  • Total devices
  • Online in last hour
  • Training eligible percentage
  • SDK error rate

Training Operations

Tracks experiment execution state:

  • Active rounds
  • Failed rounds
  • Active rollouts
  • Models tracked

Incident Command

Triage-first incident table with:

  • Open count
  • MTTR
  • Range label
  • Incident list (title, severity, status, owner, created time)

Regional Health

Top regions by fleet size and key performance:

  • Region
  • Latency
  • Availability

Model Risk and Drift

Most at-risk models by drift score:

  • Model status
  • Accuracy
  • Latency
  • Drift score

Rollout Risk

Active rollout quality check:

  • Model/version
  • Canary vs rollout stage
  • 1-hour error rate

Alert Rules

Policy coverage summary:

  • Rule name
  • Severity
  • Trigger count

Where rule behavior is defined:

  • Settings → Integrations: Configure outbound channels (Slack, Email, Webhook, SIEM/PagerDuty)
  • Settings → Alerts: Create/edit alert rules and channel routing

Smart-fill metric presets in the alert form include:

  • SLO achieved ratio (slo_achieved)
  • DB latency (infra_database_latency_ms)
  • SDK crash/error rates (sdk_crash_rate_pct, sdk_error_rate_pct)
  • Training completion and failed rounds
  • Rollout error rate
  • Incident MTTR and resolution rate

You can also choose Custom metric and provide metric_source + metric_name.

Supported aggregations:

  • latest, count, rate, avg
  • p95, p99, median, min, max, pct

Rules evaluate continuously in the background and can open/resolve incidents as conditions change.

Example rule shape used by the API:

{
"name": "Rollout error rate high",
"alert_type": "threshold",
"severity": "error",
"metric_name": "error_rate_pct",
"metric_source": "rollout",
"threshold_operator": "gt",
"threshold_value": 5,
"threshold_duration_minutes": 10,
"filters": {
"aggregation": {
"type": "p95",
"window_minutes": 10
}
},
"notify_channels": ["integration-id-slack", "integration-id-email"]
}

Readiness by Model and Region

Side-by-side confidence signals:

  • Model readiness table
  • Geo distribution table

Data Inputs

Monitoring consolidates reliability, fleet, training, alert, and incident signals into one client-facing view so operators can make rollout and response decisions quickly.

Production policy:

  • The dashboard must show live telemetry only.
  • If a metric is unavailable, show N/A or unknown with context instead of synthetic placeholders.
  • Regional latency is displayed with source-aware labeling so customers can distinguish measured vs estimated values.

Troubleshooting

Health score looks wrong

Check these first:

  • SLO value and target
  • Incident open count and severity mix
  • SDK crash rate
  • Round completion rate

The score is intentionally conservative and may be lower than a single strong metric.

Data freshness is old

  • Use Refresh Now.
  • Confirm backend services are reachable.
  • Check Partial Failures; non-zero means some APIs failed in the latest refresh cycle.

Empty panels in narrow ranges

For short ranges (for example 24h), some panels can legitimately be empty if there was no recent activity.

Gotchas

  • Health score is intentionally conservative — a single weak signal (high incident count, low SLO) can drag the overall score below what individual metrics suggest. This is by design.
  • Time range affects all panels — changing the time range reloads every panel. For large fleets, the 90-day and 1-year ranges may take several seconds to load.
  • Alert rules evaluate continuously — rules run in the background even when the dashboard is closed. Alerts fire based on metric conditions, not page visits.
  • Regional latency may be estimated — if measured data isn't available for a region, the dashboard shows an estimate with a label. Don't treat estimated values as SLO baselines.
  • Partial failures are transient — a non-zero partial failure count usually means one API timed out during the refresh cycle. Hit Refresh Now to retry. Persistent failures indicate a backend issue.
  • Model drift scores update daily — drift detection runs on a daily cadence, not real-time. A newly deployed model won't show drift scores until the next daily evaluation.