Skip to main content

Observability

Octomil telemetry automatically captures inference metrics and reports them to your Octomil dashboard. When enabled, every inference request generates structured events covering latency, throughput, and error rates -- with zero impact on inference performance.

Enabling Telemetry

There are two ways to enable telemetry, depending on whether you are using octomil serve or building a custom application.

Option 1: octomil serve

Pass your API key when starting the server:

octomil serve gemma-1b --api-key <your-api-key>

Telemetry starts automatically. No code changes needed.

Option 2: Programmatic (Custom Applications)

For applications that use Octomil inference directly without the serve command:

import octomil

octomil.init(api_key="<your-api-key>")

# All inference calls are now instrumented automatically.
# Telemetry is dispatched in the background.

Environment Variable Fallback

Both modes check environment variables when explicit values are not provided:

VariableDescriptionDefault
OCTOMIL_API_KEYAPI key for authenticationnone
OCTOMIL_ORG_IDOrganization ID (auto-resolved from API key if omitted)none
OCTOMIL_API_BASEOctomil server base URLhttps://api.octomil.com

Example:

export OCTOMIL_API_KEY=<your-api-key>
export OCTOMIL_API_BASE=https://api.octomil.com

# No --api-key flag needed
octomil serve gemma-1b

What Gets Reported

Telemetry emits three event types during each inference request.

Event Lifecycle

  1. generation_started -- emitted when the request is received and inference begins.
  2. chunk_produced -- emitted for each output token (streaming mode) or once for the full response (non-streaming).
  3. generation_completed -- emitted when inference finishes, with aggregate metrics for the request.

Metrics Captured

MetricUnitDescription
TTFC (time to first chunk)msTime from request receipt to first output token. Measures model startup and prefill latency.
Chunk latencymsTime between consecutive output tokens. Indicates decode throughput stability.
Throughputtok/sOutput tokens per second averaged over the full generation.
Total durationmsEnd-to-end request duration including prefill and decode.
Input tokenscountNumber of tokens in the prompt.
Output tokenscountNumber of tokens generated.
EnginestringWhich inference engine handled the request (mlx, llamacpp, echo).
ModelstringModel identifier used for the request.

Event Schema Reference

Each event is a JSON object sent to the Octomil API:

{
"event_type": "generation_completed",
"timestamp": "2026-02-19T14:32:01.442Z",
"device_id": "a1b2c3d4e5f6",
"org_id": "org_abc123",
"model": "gemma-1b",
"engine": "mlx",
"metrics": {
"ttfc_ms": 142.3,
"throughput_tok_s": 38.7,
"total_duration_ms": 1842.1,
"input_tokens": 47,
"output_tokens": 128,
"chunk_latency_avg_ms": 25.8,
"chunk_latency_p99_ms": 41.2
},
"request_id": "req_7f8a9b0c",
"stream": true
}

Device Identification

Each telemetry source is identified by a stable device ID. This ID is generated automatically from a hash of the hostname and primary MAC address, so it remains consistent across server restarts without requiring manual configuration.

The device ID appears in the dashboard under Monitoring and can be used to filter metrics by machine.

Best-Effort Dispatch

Telemetry is designed to never interfere with inference:

  • Events are queued in memory and dispatched by a background thread.
  • If the Octomil API is unreachable, events are dropped silently. No retries block the inference path.
  • If the event queue is full, the oldest events are discarded.
  • The background thread shuts down gracefully when the server stops.

This means telemetry is best-effort. In normal operation, event loss is negligible. Under extreme network disruption, some events may be lost. This is an intentional tradeoff to guarantee that telemetry never adds latency to inference responses.

Dashboard

Inference telemetry is visible in the Monitoring Dashboard under the inference metrics section. The dashboard displays:

  • TTFC distribution over time
  • Throughput trends by model and engine
  • Request volume and error rates
  • Per-device performance breakdown

Use the time range picker to zoom into specific periods and the model filter to isolate individual models.

TelemetryReporter API Reference

For advanced use cases, you can interact with the TelemetryReporter class directly.

from octomil.telemetry import TelemetryReporter

reporter = TelemetryReporter(
api_key="<your-api-key>",
api_base="https://api.octomil.com",
flush_interval_seconds=5.0,
max_queue_size=1000,
)

# Report a custom event
reporter.report({
"event_type": "generation_completed",
"model": "gemma-1b",
"engine": "mlx",
"metrics": {
"ttfc_ms": 150.0,
"throughput_tok_s": 35.2,
"total_duration_ms": 2100.0,
"input_tokens": 52,
"output_tokens": 200,
},
})

# Flush pending events immediately
reporter.flush()

# Shut down the background thread
reporter.shutdown()

Constructor Parameters

ParameterTypeDefaultDescription
api_keystrrequiredOctomil API key
api_basestrhttps://api.octomil.comServer base URL
flush_interval_secondsfloat5.0How often the background thread flushes queued events
max_queue_sizeint1000Maximum events held in memory before oldest are dropped
device_idstrauto-generatedOverride the auto-generated device ID

Code Examples

Serve Mode with Telemetry

# Start serving with telemetry enabled
octomil serve gemma-1b --api-key <your-api-key>

In another terminal, make requests as normal. Telemetry is reported automatically in the background.

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [{"role": "user", "content": "Hello"}]
}'

Programmatic Mode

import octomil

# Initialize telemetry
octomil.init(api_key="<your-api-key>")

# Use the inference API as normal
response = octomil.generate(
model="gemma-1b",
messages=[{"role": "user", "content": "Hello"}],
)

# Telemetry for this request is dispatched automatically
print(response.content)

Verifying Telemetry is Active

Check the server logs for telemetry confirmation:

[telemetry] Initialized: device_id=a1b2c3d4e5f6, api_base=https://api.octomil.com
[telemetry] Flushed 3 events (queue_size=0)

If you see [telemetry] Disabled: no API key provided, set --api-key or the OCTOMIL_API_KEY environment variable.

External Forwarding

Telemetry captured on this page covers the Octomil dashboard. To forward raw per-event data to your own Grafana, Datadog, or any OTLP-compatible collector, see Export Metrics — OTLP Collector.