Skip to main content

Node SDK

Server-side SDK for managing Octomil resources from Node.js and TypeScript backends.

Installation

pnpm add @octomil/sdk

Quick Start

import { OctomilClient } from "@octomil/sdk";

const client = new OctomilClient({ apiKey: "edg_...", orgId: "your-org-id" });

// List models
const models = await client.models.list();

// Deploy a model
await client.deployments.create({
modelId: "sentiment-v1",
version: "2.0.0",
rollout: 10,
strategy: "canary",
});

Integrations Management

Manage metrics and log export integrations programmatically.

import { OctomilClient } from "@octomil/sdk";

const client = new OctomilClient({ apiKey: "edg_...", orgId: "your-org-id" });

// Connect OTLP collector (metrics + logs in one call)
const { metrics, logs } = await client.integrations.connectOtlpCollector({
name: "Production Grafana",
endpoint: "http://otel-collector:4318",
headers: { Authorization: "Basic abc123" },
});

// List integrations
const metricsIntegrations = await client.integrations.listMetricsIntegrations();
const logIntegrations = await client.integrations.listLogIntegrations();

// Create individual integrations
await client.integrations.createMetricsIntegration({
name: "Prod Prometheus",
integration_type: "prometheus",
config: { prefix: "octomil", scrape_interval: 30 },
});

// Test and delete
await client.integrations.testMetricsIntegration(integrationId);
await client.integrations.deleteMetricsIntegration(integrationId);

Inference

Pull a model from the registry, cache it locally, and run inference:

import { OctomilClient } from "@octomil/sdk";

const client = new OctomilClient({
apiKey: "edg_...",
orgId: "your-org-id",
// serverUrl: "https://api.octomil.com", // optional, default
// cacheDir: "~/.octomil/models", // optional, default
});

// Pull, cache, and predict in one call
const output = await client.predict("sentiment-v1", { text: "Octomil is great" });
console.log(output.label); // e.g. "positive"
console.log(output.score); // e.g. 0.97
console.log(output.latencyMs); // inference time in ms

Pull and load separately

// Pull downloads and caches the model
const model = await client.pull("sentiment-v1", {
version: "2.0.0",
format: "onnx",
force: false, // skip re-download if cached
onProgress: (downloaded, total) => {
console.log(`${Math.round(downloaded / total * 100)}%`);
},
});

// Load with engine options
await model.load({
executionProvider: "cpu", // "cpu" | "cuda" | "tensorrt" | "coreml"
graphOptimizationLevel: "all",
intraOpNumThreads: 4,
});

// Run inference
const result = await model.predict({ text: "Hello world" });

Cache management

// List cached models
const cached = await client.listCached();
// [{ modelRef, filePath, cachedAt, sizeBytes }]

// Remove a cached model
await client.removeCache("sentiment-v1");

Streaming Inference

Stream tokens from the Octomil cloud inference endpoint via Server-Sent Events. This calls POST /api/v1/inference/stream and yields StreamToken objects as they arrive.

import { OctomilClient } from "@octomil/sdk";

const client = new OctomilClient({ apiKey: "edg_...", orgId: "your-org-id" });

// Stream with chat messages
for await (const token of client.streamPredict(
"phi-4-mini",
[{ role: "user", content: "Write a haiku about edge AI." }],
{ temperature: 0.8, max_tokens: 64 },
)) {
process.stdout.write(token.token);
if (token.done) break;
}

String prompt

for await (const token of client.streamPredict("phi-4-mini", "Explain federated learning.")) {
process.stdout.write(token.token);
}

Standalone streamInference function

You can also use the streaming function directly without OctomilClient:

import { streamInference } from "@octomil/sdk";

const config = {
serverUrl: "https://api.octomil.com",
apiKey: "edg_...",
};

for await (const token of streamInference(config, "phi-4-mini", "Hello")) {
process.stdout.write(token.token);
}

Each StreamToken contains:

FieldTypeDescription
tokenstringThe generated text fragment
donebooleantrue on the final token
providerstring?Which backend served the request
latencyMsnumber?Server-side latency for this token
sessionIdstring?Unique session identifier

Routing

The RoutingClient asks the Octomil API whether to run inference on-device or in the cloud, based on model size, device capabilities, and routing preference.

import { RoutingClient, detectDeviceCapabilities } from "@octomil/sdk";

const routing = new RoutingClient({
serverUrl: "https://api.octomil.com",
apiKey: "edg_...",
cacheTtlMs: 300_000, // cache routing decisions for 5 minutes (default)
prefer: "fastest", // "device" | "cloud" | "cheapest" | "fastest"
});

const capabilities = await detectDeviceCapabilities();
// { platform: "node", model: "Darwin arm64 ...", total_memory_mb: 16384,
// gpu_available: false, npu_available: false, supported_runtimes: ["onnxruntime-node"] }

const decision = await routing.route(
"phi-4-mini", // modelId
3_800_000_000, // modelParams
2048, // modelSizeMb
capabilities,
);

if (decision) {
console.log(decision.target); // "device" or "cloud"
console.log(decision.format); // e.g. "onnx"
console.log(decision.engine); // e.g. "onnxruntime-node"
console.log(decision.fallback_target); // cloud fallback endpoint, or null
}

Cloud inference via routing

When routing decides to use the cloud, call cloudInfer():

if (decision?.target === "cloud") {
const result = await routing.cloudInfer(
"phi-4-mini",
{ prompt: "Hello, world!" },
{ temperature: 0.7, max_tokens: 128 },
);
console.log(result.output); // model response
console.log(result.latency_ms); // server-side latency
console.log(result.provider); // which cloud provider served it
}

Cache management

// Invalidate a specific model's cached routing decision
routing.invalidate("phi-4-mini");

// Clear all cached routing decisions
routing.clearCache();

Batch Inference

Run multiple predictions sequentially by iterating over inputs. The predict() method handles model caching internally, so the model is only downloaded and loaded once.

const inputs = [
{ text: "This product is amazing" },
{ text: "Terrible experience" },
{ text: "It was okay, nothing special" },
];

const results = [];
for (const input of inputs) {
const output = await client.predict("sentiment-v1", input);
results.push({ input: input.text, label: output.label, score: output.score });
}

console.table(results);

For higher throughput, use Promise.all with multiple model instances (each pull returns an independent Model):

const model = await client.pull("sentiment-v1");
await model.load({ executionProvider: "cpu" });

const results = await Promise.all(
inputs.map(input => model.predict(input)),
);

Requirements

  • Node.js 18+
  • TypeScript 5.0+ (optional but recommended)