Monitoring Dashboard

The Monitoring page (/monitoring) is an operational command center for on-device AI deployments. It combines SLO status, infrastructure health, fleet readiness, rollout quality, alerting, and incident response in one view.

What You See

The page is organized as stacked panels instead of the older sidebar/section layout:

Global Health Score card
Reliability Snapshot
Fleet Readiness
Training Operations
Incident Command
Regional Health
Model Risk and Drift
Rollout Risk
Alert Rules
Readiness by Model and Region

Global Controls

Top-right controls:

Time Range picker
Auto Refresh toggle
Refresh Now button

Time Range

The range picker supports:

Past 24h
Past 7d
Past 14d
Past 30d
Past 90d
Past 1y
Custom... (start date to now)

Notes:

Monitoring aggregate APIs are day-window based (days / window_days).
Custom... is converted to a day-window from the selected start date to now.

Global Health Score

The score is shown as 0-100 with status bands:

excellent: >= 90
good: >= 75
warning: >= 60
critical: < 60

The score blends multiple weighted signals:

SLO attainment
Infrastructure status
SDK stability
Incident pressure
Round completion

Safeguards:

Handles missing/invalid slices without forcing misleading values.
Avoids optimistic rounding to perfect 100.
If SLO data exists, overall score is capped by normalized SLO performance.

KPI Strip

The top KPI strip includes:

SLO Attainment
Open Incidents
Crash Rate
Round Completion
Data Freshness (last successful aggregate fetch)
Time Range
Partial Failures (count of failed API calls in latest batch refresh)

Panel Reference

Reliability Snapshot

Focuses on platform reliability indicators:

Error budget spent
Error logs (24h)
Database latency
Storage type

Fleet Readiness

Shows operational device readiness:

Total devices
Online in last hour
Training eligible percentage
SDK error rate

Training Operations

Tracks active training and rollout activity:

Active rounds
Failed rounds
Active rollouts
Models tracked

Incident Command

Triage-first incident table with:

Open count
MTTR
Range label
Incident list (title, severity, status, owner, created time)

Regional Health

Top regions by fleet size and key performance:

Region
Latency
Availability

Model Risk and Drift

Most at-risk models by drift score:

Model status
Accuracy
Latency
Drift score

Rollout Risk

Active rollout quality check:

Model/version
Canary vs rollout stage
1-hour error rate

Alert Rules

Policy coverage summary:

Rule name
Severity
Trigger count

Where rule behavior is defined:

Settings → Integrations: Configure outbound channels (Slack, Email, Webhook, SIEM/PagerDuty)
Settings → Alerts: Create/edit alert rules and channel routing

Smart-fill metric presets in the alert form include:

SLO achieved ratio (slo_achieved)
DB latency (infra_database_latency_ms)
SDK crash/error rates (sdk_crash_rate_pct, sdk_error_rate_pct)
Training completion and failed rounds
Rollout error rate
Incident MTTR and resolution rate

You can also choose Custom metric and provide metric_source + metric_name.

Supported aggregations:

latest, count, rate, avg
p95, p99, median, min, max, pct

Rules evaluate continuously in the background and can open/resolve incidents as conditions change.

Example rule shape used by the API:

{
  "name": "Rollout error rate high",
  "alert_type": "threshold",
  "severity": "error",
  "metric_name": "error_rate_pct",
  "metric_source": "rollout",
  "threshold_operator": "gt",
  "threshold_value": 5,
  "threshold_duration_minutes": 10,
  "filters": {
    "aggregation": {
      "type": "p95",
      "window_minutes": 10
    }
  },
  "notify_channels": ["integration-id-slack", "integration-id-email"]
}

Readiness by Model and Region

Side-by-side confidence signals:

Model readiness table
Geo distribution table

Data Inputs

Monitoring consolidates reliability, fleet, training, alert, and incident signals into one client-facing view so operators can make rollout and response decisions quickly.

Production policy:

The dashboard must show live telemetry only.
If a metric is unavailable, show N/A or unknown with context instead of synthetic placeholders.
Regional latency is displayed with source-aware labeling so customers can distinguish measured vs estimated values.

Troubleshooting

Health score looks wrong

Check these first:

SLO value and target
Incident open count and severity mix
SDK crash rate
Round completion rate

The score is intentionally conservative and may be lower than a single strong metric.

Data freshness is old

Use Refresh Now.
Confirm backend services are reachable.
Check Partial Failures; non-zero means some APIs failed in the latest refresh cycle.

Empty panels in narrow ranges

For short ranges (for example 24h), some panels can legitimately be empty if there was no recent activity.

Gotchas

Health score is intentionally conservative — a single weak signal (high incident count, low SLO) can drag the overall score below what individual metrics suggest. This is by design.
Time range affects all panels — changing the time range reloads every panel. For large fleets, the 90-day and 1-year ranges may take several seconds to load.
Alert rules evaluate continuously — rules run in the background even when the dashboard is closed. Alerts fire based on metric conditions, not page visits.
Regional latency may be estimated — if measured data isn't available for a region, the dashboard shows an estimate with a label. Don't treat estimated values as SLO baselines.
Partial failures are transient — a non-zero partial failure count usually means one API timed out during the refresh cycle. Hit Refresh Now to retry. Persistent failures indicate a backend issue.
Model drift scores update daily — drift detection runs on a daily cadence, not real-time. A newly deployed model won't show drift scores until the next daily evaluation.

Observability — inference and training telemetry
Audit Logs — event-level audit trail
Workspace Settings — metrics export, alert routing
Rollouts — rollout risk panel details
Experiments — training operations panel details

What You See​

Global Controls​

Time Range​

Global Health Score​

KPI Strip​

Panel Reference​

Reliability Snapshot​

Fleet Readiness​

Training Operations​

Incident Command​

Regional Health​

Model Risk and Drift​

Rollout Risk​

Alert Rules​

Readiness by Model and Region​

Data Inputs​

Troubleshooting​

Health score looks wrong​

Data freshness is old​

Empty panels in narrow ranges​

Gotchas​

Related​