Observability
Octomil telemetry automatically captures inference metrics and reports them to your Octomil dashboard. When enabled, every inference request generates structured events covering latency, throughput, and error rates -- with zero impact on inference performance.
Enabling Telemetry
There are two ways to enable telemetry, depending on whether you are using octomil serve or building a custom application.
Option 1: octomil serve
Pass your API key when starting the server:
octomil serve gemma-1b --api-key <your-api-key>
Telemetry starts automatically. No code changes needed.
Option 2: Programmatic (Custom Applications)
For applications that use Octomil inference directly without the serve command:
import octomil
octomil.init(api_key="<your-api-key>")
# All inference calls are now instrumented automatically.
# Telemetry is dispatched in the background.
Environment Variable Fallback
Both modes check environment variables when explicit values are not provided:
| Variable | Description | Default |
|---|---|---|
OCTOMIL_API_KEY | API key for authentication | none |
OCTOMIL_ORG_ID | Organization ID (auto-resolved from API key if omitted) | none |
OCTOMIL_API_BASE | Octomil server base URL | https://api.octomil.com |
Example:
export OCTOMIL_API_KEY=<your-api-key>
export OCTOMIL_API_BASE=https://api.octomil.com
# No --api-key flag needed
octomil serve gemma-1b
What Gets Reported
Telemetry emits three event types during each inference request.
Event Lifecycle
generation_started-- emitted when the request is received and inference begins.chunk_produced-- emitted for each output token (streaming mode) or once for the full response (non-streaming).generation_completed-- emitted when inference finishes, with aggregate metrics for the request.
Metrics Captured
| Metric | Unit | Description |
|---|---|---|
| TTFC (time to first chunk) | ms | Time from request receipt to first output token. Measures model startup and prefill latency. |
| Chunk latency | ms | Time between consecutive output tokens. Indicates decode throughput stability. |
| Throughput | tok/s | Output tokens per second averaged over the full generation. |
| Total duration | ms | End-to-end request duration including prefill and decode. |
| Input tokens | count | Number of tokens in the prompt. |
| Output tokens | count | Number of tokens generated. |
| Engine | string | Which inference engine handled the request (mlx, llamacpp, echo). |
| Model | string | Model identifier used for the request. |
Event Schema Reference
Each event is a JSON object sent to the Octomil API:
{
"event_type": "generation_completed",
"timestamp": "2026-02-19T14:32:01.442Z",
"device_id": "a1b2c3d4e5f6",
"org_id": "org_abc123",
"model": "gemma-1b",
"engine": "mlx",
"metrics": {
"ttfc_ms": 142.3,
"throughput_tok_s": 38.7,
"total_duration_ms": 1842.1,
"input_tokens": 47,
"output_tokens": 128,
"chunk_latency_avg_ms": 25.8,
"chunk_latency_p99_ms": 41.2
},
"request_id": "req_7f8a9b0c",
"stream": true
}
Device Identification
Each telemetry source is identified by a stable device ID. This ID is generated automatically from a hash of the hostname and primary MAC address, so it remains consistent across server restarts without requiring manual configuration.
The device ID appears in the dashboard under Monitoring and can be used to filter metrics by machine.
Best-Effort Dispatch
Telemetry is designed to never interfere with inference:
- Events are queued in memory and dispatched by a background thread.
- If the Octomil API is unreachable, events are dropped silently. No retries block the inference path.
- If the event queue is full, the oldest events are discarded.
- The background thread shuts down gracefully when the server stops.
This means telemetry is best-effort. In normal operation, event loss is negligible. Under extreme network disruption, some events may be lost. This is an intentional tradeoff to guarantee that telemetry never adds latency to inference responses.
Dashboard
Inference telemetry is visible in the Monitoring Dashboard under the inference metrics section. The dashboard displays:
- TTFC distribution over time
- Throughput trends by model and engine
- Request volume and error rates
- Per-device performance breakdown
Use the time range picker to zoom into specific periods and the model filter to isolate individual models.
TelemetryReporter API Reference
For advanced use cases, you can interact with the TelemetryReporter class directly.
from octomil.telemetry import TelemetryReporter
reporter = TelemetryReporter(
api_key="<your-api-key>",
api_base="https://api.octomil.com",
flush_interval_seconds=5.0,
max_queue_size=1000,
)
# Report a custom event
reporter.report({
"event_type": "generation_completed",
"model": "gemma-1b",
"engine": "mlx",
"metrics": {
"ttfc_ms": 150.0,
"throughput_tok_s": 35.2,
"total_duration_ms": 2100.0,
"input_tokens": 52,
"output_tokens": 200,
},
})
# Flush pending events immediately
reporter.flush()
# Shut down the background thread
reporter.shutdown()
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | str | required | Octomil API key |
api_base | str | https://api.octomil.com | Server base URL |
flush_interval_seconds | float | 5.0 | How often the background thread flushes queued events |
max_queue_size | int | 1000 | Maximum events held in memory before oldest are dropped |
device_id | str | auto-generated | Override the auto-generated device ID |
Code Examples
Serve Mode with Telemetry
# Start serving with telemetry enabled
octomil serve gemma-1b --api-key <your-api-key>
In another terminal, make requests as normal. Telemetry is reported automatically in the background.
- cURL
- Python
- JavaScript
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-1b",
"messages": [{"role": "user", "content": "Hello"}]
}'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="gemma-1b",
messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "not-needed",
});
const response = await client.chat.completions.create({
model: "gemma-1b",
messages: [{ role: "user", content: "Hello" }],
});
console.log(response.choices[0].message.content);
Programmatic Mode
import octomil
# Initialize telemetry
octomil.init(api_key="<your-api-key>")
# Use the inference API as normal
response = octomil.generate(
model="gemma-1b",
messages=[{"role": "user", "content": "Hello"}],
)
# Telemetry for this request is dispatched automatically
print(response.content)
Verifying Telemetry is Active
Check the server logs for telemetry confirmation:
[telemetry] Initialized: device_id=a1b2c3d4e5f6, api_base=https://api.octomil.com
[telemetry] Flushed 3 events (queue_size=0)
If you see [telemetry] Disabled: no API key provided, set --api-key or the OCTOMIL_API_KEY environment variable.
External Forwarding
Telemetry captured on this page covers the Octomil dashboard. To forward raw per-event data to your own Grafana, Datadog, or any OTLP-compatible collector, see Export Metrics — OTLP Collector.
Related Docs
- Octomil Serve -- local inference server setup
- Move-to-Device Recommendations -- automated recommendations based on telemetry data
- Monitoring Dashboard -- view telemetry in the dashboard
- Workspace Settings -- configure integrations and alert routing