Node SDK

Server-side SDK for managing Octomil resources from Node.js and TypeScript backends.

Installation

pnpm add @octomil/sdk

Quick Start

import { OctomilClient } from "@octomil/sdk";

const client = new OctomilClient({ apiKey: "YOUR_SERVER_KEY", orgId: "your-org-id" });

// List models
const models = await client.models.list();

// Deploy a model
await client.deployments.create({
  modelId: "sentiment-v1",
  version: "2.0.0",
  rollout: 10,
  strategy: "canary",
});

Integrations Management

Manage metrics and log export integrations programmatically.

import { OctomilClient } from "@octomil/sdk";

const client = new OctomilClient({ apiKey: "YOUR_SERVER_KEY", orgId: "your-org-id" });

// Connect OTLP collector (metrics + logs in one call)
const { metrics, logs } = await client.integrations.connectOtlpCollector({
  name: "Production Grafana",
  endpoint: "http://otel-collector:4318",
  headers: { Authorization: "Basic abc123" },
});

// List integrations
const metricsIntegrations = await client.integrations.listMetricsIntegrations();
const logIntegrations = await client.integrations.listLogIntegrations();

// Create individual integrations
await client.integrations.createMetricsIntegration({
  name: "Prod Prometheus",
  integration_type: "prometheus",
  config: { prefix: "octomil", scrape_interval: 30 },
});

// Test and delete
await client.integrations.testMetricsIntegration(integrationId);
await client.integrations.deleteMetricsIntegration(integrationId);

Inference

Pull a model from the registry, cache it locally, and run inference:

import { OctomilClient } from "@octomil/sdk";

const client = new OctomilClient({
  apiKey: "YOUR_SERVER_KEY",
  orgId: "your-org-id",
  // serverUrl: "https://api.octomil.com",   // optional, default
  // cacheDir: "~/.octomil/models",          // optional, default
});

// Pull, cache, and predict in one call
const output = await client.predict("sentiment-v1", { text: "Octomil is great" });
console.log(output.label);     // e.g. "positive"
console.log(output.score);     // e.g. 0.97
console.log(output.latencyMs); // inference time in ms

Pull and load separately

// Pull downloads and caches the model
const model = await client.pull("sentiment-v1", {
  version: "2.0.0",
  format: "onnx",
  force: false,               // skip re-download if cached
  onProgress: (downloaded, total) => {
    console.log(`${Math.round(downloaded / total * 100)}%`);
  },
});

// Load with engine options
await model.load({
  executionProvider: "cpu",    // "cpu" | "cuda" | "tensorrt" | "coreml"
  graphOptimizationLevel: "all",
  intraOpNumThreads: 4,
});

// Run inference
const result = await model.predict({ text: "Hello world" });

Cache management

// List cached models
const cached = await client.listCached();
// [{ modelRef, filePath, cachedAt, sizeBytes }]

// Remove a cached model
await client.removeCache("sentiment-v1");

Streaming Inference

Stream tokens from the Octomil cloud inference endpoint via Server-Sent Events. This calls POST /api/v1/inference/stream and yields StreamToken objects as they arrive.

import { OctomilClient } from "@octomil/sdk";

const client = new OctomilClient({ apiKey: "YOUR_SERVER_KEY", orgId: "your-org-id" });

// Stream with chat messages
for await (const token of client.streamPredict(
  "phi-4-mini",
  [{ role: "user", content: "Write a haiku about edge AI." }],
  { temperature: 0.8, max_tokens: 64 },
)) {
  process.stdout.write(token.token);
  if (token.done) break;
}

String prompt

for await (const token of client.streamPredict("phi-4-mini", "Explain federated learning.")) {
  process.stdout.write(token.token);
}

Standalone `streamInference` function

You can also use the streaming function directly without OctomilClient:

import { streamInference } from "@octomil/sdk";

const config = {
  serverUrl: "https://api.octomil.com",
  apiKey: "YOUR_SERVER_KEY",
};

for await (const token of streamInference(config, "phi-4-mini", "Hello")) {
  process.stdout.write(token.token);
}

Each StreamToken contains:

Field	Type	Description
`token`	`string`	The generated text fragment
`done`	`boolean`	`true` on the final token
`provider`	`string?`	Which backend served the request
`latencyMs`	`number?`	Server-side latency for this token
`sessionId`	`string?`	Unique session identifier

Routing

The RoutingClient asks the Octomil API whether to run inference on-device or in the cloud, based on model size, device capabilities, and routing preference.

import { RoutingClient, detectDeviceCapabilities } from "@octomil/sdk";

const routing = new RoutingClient({
  serverUrl: "https://api.octomil.com",
  apiKey: "YOUR_SERVER_KEY",
  cacheTtlMs: 300_000,    // cache routing decisions for 5 minutes (default)
  prefer: "fastest",       // "device" | "cloud" | "cheapest" | "fastest"
});

const capabilities = await detectDeviceCapabilities();
// { platform: "node", model: "Darwin arm64 ...", total_memory_mb: 16384,
//   gpu_available: false, npu_available: false, supported_runtimes: ["onnxruntime-node"] }

const decision = await routing.route(
  "phi-4-mini",   // modelId
  3_800_000_000,  // modelParams
  2048,           // modelSizeMb
  capabilities,
);

if (decision) {
  console.log(decision.target);          // "device" or "cloud"
  console.log(decision.format);          // e.g. "onnx"
  console.log(decision.engine);          // e.g. "onnxruntime-node"
  console.log(decision.fallback_target); // cloud fallback endpoint, or null
}

Cloud inference via routing

When routing decides to use the cloud, call cloudInfer():

if (decision?.target === "cloud") {
  const result = await routing.cloudInfer(
    "phi-4-mini",
    { prompt: "Hello, world!" },
    { temperature: 0.7, max_tokens: 128 },
  );
  console.log(result.output);      // model response
  console.log(result.latency_ms);  // server-side latency
  console.log(result.provider);    // which cloud provider served it
}

Cache management

// Invalidate a specific model's cached routing decision
routing.invalidate("phi-4-mini");

// Clear all cached routing decisions
routing.clearCache();

Batch Inference

Run multiple predictions sequentially by iterating over inputs. The predict() method handles model caching internally, so the model is only downloaded and loaded once.

const inputs = [
  { text: "This product is amazing" },
  { text: "Terrible experience" },
  { text: "It was okay, nothing special" },
];

const results = [];
for (const input of inputs) {
  const output = await client.predict("sentiment-v1", input);
  results.push({ input: input.text, label: output.label, score: output.score });
}

console.table(results);

For higher throughput, use Promise.all with multiple model instances (each pull returns an independent Model):

const model = await client.pull("sentiment-v1");
await model.load({ executionProvider: "cpu" });

const results = await Promise.all(
  inputs.map(input => model.predict(input)),
);

Requirements

Node.js 18+
TypeScript 5.0+ (optional but recommended)

Python SDK — server-side Python client
Browser SDK — client-side browser inference
Export Metrics — configure observability export
CLI Reference — command-line interface

Installation​

Quick Start​

Integrations Management​

Inference​

Pull and load separately​

Cache management​

Streaming Inference​

String prompt​

Standalone streamInference function​

Routing​

Cloud inference via routing​

Cache management​

Batch Inference​

Requirements​

Related​

Installation

Quick Start

Integrations Management

Inference

Pull and load separately

Cache management

Streaming Inference

String prompt

Standalone `streamInference` function

Routing

Cloud inference via routing

Cache management

Batch Inference

Requirements

Related