Node SDK
Server-side SDK for managing Octomil resources from Node.js and TypeScript backends.
Installation
pnpm add @octomil/sdk
Quick Start
import { OctomilClient } from "@octomil/sdk";
const client = new OctomilClient({ apiKey: "edg_...", orgId: "your-org-id" });
// List models
const models = await client.models.list();
// Deploy a model
await client.deployments.create({
modelId: "sentiment-v1",
version: "2.0.0",
rollout: 10,
strategy: "canary",
});
Integrations Management
Manage metrics and log export integrations programmatically.
import { OctomilClient } from "@octomil/sdk";
const client = new OctomilClient({ apiKey: "edg_...", orgId: "your-org-id" });
// Connect OTLP collector (metrics + logs in one call)
const { metrics, logs } = await client.integrations.connectOtlpCollector({
name: "Production Grafana",
endpoint: "http://otel-collector:4318",
headers: { Authorization: "Basic abc123" },
});
// List integrations
const metricsIntegrations = await client.integrations.listMetricsIntegrations();
const logIntegrations = await client.integrations.listLogIntegrations();
// Create individual integrations
await client.integrations.createMetricsIntegration({
name: "Prod Prometheus",
integration_type: "prometheus",
config: { prefix: "octomil", scrape_interval: 30 },
});
// Test and delete
await client.integrations.testMetricsIntegration(integrationId);
await client.integrations.deleteMetricsIntegration(integrationId);
Inference
Pull a model from the registry, cache it locally, and run inference:
import { OctomilClient } from "@octomil/sdk";
const client = new OctomilClient({
apiKey: "edg_...",
orgId: "your-org-id",
// serverUrl: "https://api.octomil.com", // optional, default
// cacheDir: "~/.octomil/models", // optional, default
});
// Pull, cache, and predict in one call
const output = await client.predict("sentiment-v1", { text: "Octomil is great" });
console.log(output.label); // e.g. "positive"
console.log(output.score); // e.g. 0.97
console.log(output.latencyMs); // inference time in ms
Pull and load separately
// Pull downloads and caches the model
const model = await client.pull("sentiment-v1", {
version: "2.0.0",
format: "onnx",
force: false, // skip re-download if cached
onProgress: (downloaded, total) => {
console.log(`${Math.round(downloaded / total * 100)}%`);
},
});
// Load with engine options
await model.load({
executionProvider: "cpu", // "cpu" | "cuda" | "tensorrt" | "coreml"
graphOptimizationLevel: "all",
intraOpNumThreads: 4,
});
// Run inference
const result = await model.predict({ text: "Hello world" });
Cache management
// List cached models
const cached = await client.listCached();
// [{ modelRef, filePath, cachedAt, sizeBytes }]
// Remove a cached model
await client.removeCache("sentiment-v1");
Streaming Inference
Stream tokens from the Octomil cloud inference endpoint via Server-Sent Events. This calls POST /api/v1/inference/stream and yields StreamToken objects as they arrive.
import { OctomilClient } from "@octomil/sdk";
const client = new OctomilClient({ apiKey: "edg_...", orgId: "your-org-id" });
// Stream with chat messages
for await (const token of client.streamPredict(
"phi-4-mini",
[{ role: "user", content: "Write a haiku about edge AI." }],
{ temperature: 0.8, max_tokens: 64 },
)) {
process.stdout.write(token.token);
if (token.done) break;
}
String prompt
for await (const token of client.streamPredict("phi-4-mini", "Explain federated learning.")) {
process.stdout.write(token.token);
}
Standalone streamInference function
You can also use the streaming function directly without OctomilClient:
import { streamInference } from "@octomil/sdk";
const config = {
serverUrl: "https://api.octomil.com",
apiKey: "edg_...",
};
for await (const token of streamInference(config, "phi-4-mini", "Hello")) {
process.stdout.write(token.token);
}
Each StreamToken contains:
| Field | Type | Description |
|---|---|---|
token | string | The generated text fragment |
done | boolean | true on the final token |
provider | string? | Which backend served the request |
latencyMs | number? | Server-side latency for this token |
sessionId | string? | Unique session identifier |
Routing
The RoutingClient asks the Octomil API whether to run inference on-device or in the cloud, based on model size, device capabilities, and routing preference.
import { RoutingClient, detectDeviceCapabilities } from "@octomil/sdk";
const routing = new RoutingClient({
serverUrl: "https://api.octomil.com",
apiKey: "edg_...",
cacheTtlMs: 300_000, // cache routing decisions for 5 minutes (default)
prefer: "fastest", // "device" | "cloud" | "cheapest" | "fastest"
});
const capabilities = await detectDeviceCapabilities();
// { platform: "node", model: "Darwin arm64 ...", total_memory_mb: 16384,
// gpu_available: false, npu_available: false, supported_runtimes: ["onnxruntime-node"] }
const decision = await routing.route(
"phi-4-mini", // modelId
3_800_000_000, // modelParams
2048, // modelSizeMb
capabilities,
);
if (decision) {
console.log(decision.target); // "device" or "cloud"
console.log(decision.format); // e.g. "onnx"
console.log(decision.engine); // e.g. "onnxruntime-node"
console.log(decision.fallback_target); // cloud fallback endpoint, or null
}
Cloud inference via routing
When routing decides to use the cloud, call cloudInfer():
if (decision?.target === "cloud") {
const result = await routing.cloudInfer(
"phi-4-mini",
{ prompt: "Hello, world!" },
{ temperature: 0.7, max_tokens: 128 },
);
console.log(result.output); // model response
console.log(result.latency_ms); // server-side latency
console.log(result.provider); // which cloud provider served it
}
Cache management
// Invalidate a specific model's cached routing decision
routing.invalidate("phi-4-mini");
// Clear all cached routing decisions
routing.clearCache();
Batch Inference
Run multiple predictions sequentially by iterating over inputs. The predict() method handles model caching internally, so the model is only downloaded and loaded once.
const inputs = [
{ text: "This product is amazing" },
{ text: "Terrible experience" },
{ text: "It was okay, nothing special" },
];
const results = [];
for (const input of inputs) {
const output = await client.predict("sentiment-v1", input);
results.push({ input: input.text, label: output.label, score: output.score });
}
console.table(results);
For higher throughput, use Promise.all with multiple model instances (each pull returns an independent Model):
const model = await client.pull("sentiment-v1");
await model.load({ executionProvider: "cpu" });
const results = await Promise.all(
inputs.map(input => model.predict(input)),
);
Requirements
- Node.js 18+
- TypeScript 5.0+ (optional but recommended)
Related
- Python SDK — server-side Python client
- Browser SDK — client-side browser inference
- Export Metrics — configure observability export
- CLI Reference — command-line interface