Observability

Octomil telemetry automatically captures inference metrics and reports them to your Octomil dashboard. When enabled, every inference request generates structured events covering latency, throughput, and error rates -- with zero impact on inference performance.

Enabling Telemetry

There are two ways to enable telemetry, depending on whether you are using octomil serve or building a custom application.

Option 1: octomil serve

Pass your API key when starting the server:

octomil serve gemma3-1b --api-key <your-api-key>

Telemetry starts automatically. No code changes needed.

Option 2: Programmatic (Custom Applications)

For applications that use Octomil inference directly without the serve command:

import octomil

octomil.init(api_key="<your-api-key>")

# All inference calls are now instrumented automatically.
# Telemetry is dispatched in the background.

Environment Variable Fallback

Both modes check environment variables when explicit values are not provided:

Variable	Description	Default
`OCTOMIL_API_KEY`	API key for authentication	none
`OCTOMIL_ORG_ID`	Organization ID (auto-resolved from API key if omitted)	none
`OCTOMIL_API_BASE`	Octomil server base URL	`https://api.octomil.com`

Example:

export OCTOMIL_API_KEY=<your-api-key>
export OCTOMIL_API_BASE=https://api.octomil.com

# No --api-key flag needed
octomil serve gemma3-1b

What Gets Reported

Telemetry emits three event types during each inference request.

Event Lifecycle

generation_started -- emitted when the request is received and inference begins.
chunk_produced -- emitted for each output token (streaming mode) or once for the full response (non-streaming).
generation_completed -- emitted when inference finishes, with aggregate metrics for the request.

Metrics Captured

Metric	Unit	Description
TTFC (time to first chunk)	ms	Time from request receipt to first output token. Measures model startup and prefill latency.
Chunk latency	ms	Time between consecutive output tokens. Indicates decode throughput stability.
Throughput	tok/s	Output tokens per second averaged over the full generation.
Total duration	ms	End-to-end request duration including prefill and decode.
Input tokens	count	Number of tokens in the prompt.
Output tokens	count	Number of tokens generated.
Engine	string	Which inference engine handled the request (`mlx`, `llamacpp`, `echo`).
Model	string	Model identifier used for the request.

Event Schema Reference

Each event is a JSON object sent to the Octomil API:

{
  "event_type": "generation_completed",
  "timestamp": "2026-02-19T14:32:01.442Z",
  "device_id": "a1b2c3d4e5f6",
  "org_id": "org_abc123",
  "model": "gemma3-1b",
  "engine": "mlx",
  "metrics": {
    "ttfc_ms": 142.3,
    "throughput_tok_s": 38.7,
    "total_duration_ms": 1842.1,
    "input_tokens": 47,
    "output_tokens": 128,
    "chunk_latency_avg_ms": 25.8,
    "chunk_latency_p99_ms": 41.2
  },
  "request_id": "req_7f8a9b0c",
  "stream": true
}

Device Identification

Each telemetry source is identified by a stable device ID. This ID is generated automatically from a hash of the hostname and primary MAC address, so it remains consistent across server restarts without requiring manual configuration.

The device ID appears in the dashboard under Monitoring and can be used to filter metrics by machine.

Best-Effort Dispatch

Telemetry is designed to never interfere with inference:

Events are queued in memory and dispatched by a background thread.
If the Octomil API is unreachable, events are dropped silently. No retries block the inference path.
If the event queue is full, the oldest events are discarded.
The background thread shuts down gracefully when the server stops.

This means telemetry is best-effort. In normal operation, event loss is negligible. Under extreme network disruption, some events may be lost. This is an intentional tradeoff to guarantee that telemetry never adds latency to inference responses.

Dashboard

Inference telemetry is visible in the Monitoring Dashboard under the inference metrics section. The dashboard displays:

TTFC distribution over time
Throughput trends by model and engine
Request volume and error rates
Per-device performance breakdown

Use the time range picker to zoom into specific periods and the model filter to isolate individual models.

TelemetryReporter API Reference

For advanced use cases, you can interact with the TelemetryReporter class directly.

from octomil.telemetry import TelemetryReporter

reporter = TelemetryReporter(
    api_key="<your-api-key>",
    api_base="https://api.octomil.com",
    flush_interval_seconds=5.0,
    max_queue_size=1000,
)

# Report a custom event
reporter.report({
    "event_type": "generation_completed",
    "model": "gemma3-1b",
    "engine": "mlx",
    "metrics": {
        "ttfc_ms": 150.0,
        "throughput_tok_s": 35.2,
        "total_duration_ms": 2100.0,
        "input_tokens": 52,
        "output_tokens": 200,
    },
})

# Flush pending events immediately
reporter.flush()

# Shut down the background thread
reporter.shutdown()

Constructor Parameters

Parameter	Type	Default	Description
`api_key`	`str`	required	Octomil API key
`api_base`	`str`	`https://api.octomil.com`	Server base URL
`flush_interval_seconds`	`float`	`5.0`	How often the background thread flushes queued events
`max_queue_size`	`int`	`1000`	Maximum events held in memory before oldest are dropped
`device_id`	`str`	auto-generated	Override the auto-generated device ID

Code Examples

Serve Mode with Telemetry

# Start serving with telemetry enabled
octomil serve gemma3-1b --api-key <your-api-key>

In another terminal, make requests as normal. Telemetry is reported automatically in the background.

cURL
Python
JavaScript

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3-1b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="gemma3-1b",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "gemma3-1b",
  messages: [{ role: "user", content: "Hello" }],
});
console.log(response.choices[0].message.content);

Programmatic Mode

import octomil

# Initialize telemetry
octomil.init(api_key="<your-api-key>")

# Use the inference API as normal
response = octomil.generate(
    model="gemma3-1b",
    messages=[{"role": "user", "content": "Hello"}],
)

# Telemetry for this request is dispatched automatically
print(response.content)

Verifying Telemetry is Active

Check the server logs for telemetry confirmation:

[telemetry] Initialized: device_id=a1b2c3d4e5f6, api_base=https://api.octomil.com
[telemetry] Flushed 3 events (queue_size=0)

If you see [telemetry] Disabled: no API key provided, set --api-key or the OCTOMIL_API_KEY environment variable.

External Forwarding

Telemetry captured on this page covers the Octomil dashboard. To forward raw per-event data to your own Grafana, Datadog, or any OTLP-compatible collector, see Export Metrics — OTLP Collector.

Octomil Serve -- local inference server setup
Move-to-Device Recommendations -- automated recommendations based on telemetry data
Monitoring Dashboard -- view telemetry in the dashboard
Workspace Settings -- configure integrations and alert routing

Enabling Telemetry​

Option 1: octomil serve​

Option 2: Programmatic (Custom Applications)​

Environment Variable Fallback​

What Gets Reported​

Event Lifecycle​

Metrics Captured​

Event Schema Reference​

Device Identification​

Best-Effort Dispatch​

Dashboard​

TelemetryReporter API Reference​

Constructor Parameters​

Code Examples​

Serve Mode with Telemetry​

Programmatic Mode​

Verifying Telemetry is Active​

External Forwarding​

Related Docs​