Browser SDK

Run Octomil models directly in the browser. No server required. No data leaves the device. The same model you deploy to iOS and Android runs in the browser with the same API -- one model registry, every platform.

Installation

npm

npm install @octomil/browser

CDN Script Tag

<script src="https://cdn.octomil.com/@octomil/browser/dist/octomil.min.js"></script>

Quick Start

Five lines from install to prediction:

import { Octomil } from '@octomil/browser';

const octomil = new Octomil({ apiKey: 'YOUR_CLIENT_KEY' });
const model = await octomil.load('sentiment-classifier');
const result = await model.predict({ text: 'Octomil is fantastic' });
console.log(result);
// { label: "positive", confidence: 0.94 }

Hardware Acceleration

The SDK automatically detects the best available backend:

Backend	Detection	Performance	Fallback
WebGPU	Chrome 113+, Edge 113+, Firefox (behind flag)	Fastest. GPU-accelerated inference.	Automatic
WebAssembly (WASM)	All modern browsers	Good. CPU-based, SIMD-optimized.	Default

You do not need to configure the backend. Octomil detects WebGPU support at initialization and uses it when available. On browsers without WebGPU, WASM inference activates automatically. The API is identical regardless of backend.

Check which backend is active:

const octomil = new Octomil({ apiKey: 'YOUR_CLIENT_KEY' });
console.log(octomil.backend);
// "webgpu" or "wasm"

Model Caching

Models are cached in the browser after the first download using IndexedDB. Subsequent page loads skip the network request entirely.

// First visit: downloads from Octomil CDN (~2-5 seconds depending on model size)
const model = await octomil.load('sentiment-classifier');

// Second visit: loaded from IndexedDB cache (~50-200ms)
const model = await octomil.load('sentiment-classifier');

Cache Management

// Check if a model is cached
const cached = await octomil.isCached('sentiment-classifier');
// true or false

// Clear a specific model from cache
await octomil.clearCache('sentiment-classifier');

// Clear all cached models
await octomil.clearCache();

Supported Model Types

Type	Use Case	Example
Classification	Categorize text or images	Sentiment analysis, content moderation
Embeddings	Vector representations	Semantic search, similarity matching
Sentiment	Positive/negative/neutral scoring	Review analysis, feedback triage
Small LLMs	Text generation, chat	On-device assistants, summarization

Embeddings

Generate dense vector embeddings via the Octomil cloud endpoint. Useful for semantic search, clustering, and RAG pipelines.

import { embed } from '@octomil/browser';

const result = await embed(
  'https://api.octomil.com',  // serverUrl
  'YOUR_CLIENT_KEY',                   // apiKey
  'nomic-embed-text',          // modelId
  'What is federated learning?' // input (string or string[])
);

console.log(result.embeddings);
// [[0.12, -0.34, 0.56, ...]]

console.log(result.model);
// "nomic-embed-text"

console.log(result.usage);
// { promptTokens: 6, totalTokens: 6 }

Batch multiple inputs in a single call:

const result = await embed(
  'https://api.octomil.com',
  'YOUR_CLIENT_KEY',
  'nomic-embed-text',
  ['query: federated learning', 'query: edge inference', 'query: on-device AI']
);

console.log(result.embeddings.length); // 3
console.log(result.embeddings[0].length); // vector dimension (e.g. 768)

The embed() function accepts an optional AbortSignal as the fifth argument for cancellation:

const controller = new AbortController();
const result = await embed(
  'https://api.octomil.com',
  'YOUR_CLIENT_KEY',
  'nomic-embed-text',
  'cancel me',
  controller.signal
);

Smart Routing

The RoutingClient calls the Octomil routing API to decide whether inference should run on-device or in the cloud. It caches decisions with a configurable TTL and provides a cloud inference proxy when the server picks "cloud".

import { RoutingClient, detectDeviceCapabilities } from '@octomil/browser';

const router = new RoutingClient({
  serverUrl: 'https://api.octomil.com',
  apiKey: 'YOUR_CLIENT_KEY',
  prefer: 'fastest',   // "device" | "cloud" | "cheapest" | "fastest" (default: "fastest")
  cacheTtlMs: 300_000, // Cache TTL in ms (default: 5 minutes)
});

// Detect browser capabilities (WebGPU, memory, etc.)
const capabilities = await detectDeviceCapabilities();
// { platform: "web", model: "<user agent>", total_memory_mb: 8192,
//   gpu_available: true, npu_available: false, supported_runtimes: ["wasm", "webgpu"] }

// Ask the server: device or cloud?
const decision = await router.route(
  'sentiment-v1',   // modelId
  125_000_000,       // modelParams (number of parameters)
  250,               // modelSizeMb
  capabilities       // DeviceCapabilities
);

if (decision) {
  console.log(decision.target);          // "device" or "cloud"
  console.log(decision.format);          // e.g. "onnx"
  console.log(decision.engine);          // e.g. "webgpu"
  console.log(decision.fallback_target); // null or { endpoint: "..." }
}

If the routing API returns "cloud", use cloudInfer() to run inference server-side:

if (decision?.target === 'cloud') {
  const result = await router.cloudInfer(
    'sentiment-v1',                          // modelId
    { text: 'Great product' },               // inputData
    { temperature: 0.7 }                     // parameters (optional)
  );
  console.log(result.output);      // inference result
  console.log(result.latency_ms);  // server-side latency
  console.log(result.provider);    // e.g. "octomil-cloud"
}

Cache management:

// Clear all cached routing decisions
router.clearCache();

// Invalidate the decision for a specific model
router.invalidate('sentiment-v1');

Cloud Fallback

On network failure or non-200 responses from the routing API, route() returns null instead of throwing. This lets you fall back to local on-device inference gracefully:

const decision = await router.route(modelId, params, sizeMb, capabilities);

if (decision === null || decision.target === 'device') {
  // Run on-device inference
  const model = await octomil.load(modelId);
  const result = await model.predict({ text: input });
} else {
  // Cloud inference
  try {
    const result = await router.cloudInfer(modelId, { text: input });
  } catch (err) {
    // Cloud failed — fall back to device
    const model = await octomil.load(modelId);
    const result = await model.predict({ text: input });
  }
}

The fallback chain is: routing API -> cloud inference -> on-device inference. At every step, failures degrade gracefully rather than breaking the user experience. The cloudInfer() method throws on failure (unlike route() which returns null), so wrap it in a try/catch for a full fallback chain.

API Reference

`Octomil` (Constructor)

const octomil = new Octomil({
  apiKey: 'YOUR_CLIENT_KEY',               // Your Octomil client key
  orgId: 'your-org-id',               // Organization ID (optional, resolved from key)
  apiBase: 'https://api.octomil.com',   // API base URL (optional)
  telemetry: false,                    // Opt-in telemetry reporting (default: false)
});

`load(modelId, options?)`

Download and initialize a model. Returns a Model instance ready for inference.

const model = await octomil.load('sentiment-classifier', {
  version: '2.0.0',    // Optional: specific version (default: latest)
});

The model is cached automatically after download.

`predict(input)`

Run inference on a loaded model. Input and output shapes depend on the model type.

// Text classification
const result = await model.predict({ text: 'Great product, fast shipping' });
// { label: "positive", confidence: 0.92 }

// Image classification
const result = await model.predict({ image: canvasElement });
// { labels: [{ name: "cat", confidence: 0.89 }, { name: "dog", confidence: 0.08 }] }

// Embeddings
const result = await model.predict({ text: 'federated learning' });
// { embedding: Float32Array([0.12, -0.34, 0.56, ...]) }

`chat(messages, options?)`

Send a chat completion request to the server. Requires serverUrl to be configured.

const octomil = new Octomil({
  model: 'https://models.octomil.com/llm.onnx',
  serverUrl: 'https://api.octomil.com',
  apiKey: 'YOUR_CLIENT_KEY',
});
await octomil.load();

const response = await octomil.chat([
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'What is federated learning?' },
]);

console.log(response.message.content);

`chatStream(messages, options?)`

Streaming variant — yields chunks as they arrive from the server via SSE:

const stream = octomil.chatStream(
  [{ role: 'user', content: 'Explain edge computing.' }],
  { temperature: 0.7 }
);

for await (const chunk of stream) {
  process.stdout.write(chunk.content);
}

`isCached(modelId)`

Check if a model is already cached locally.

const cached = await octomil.isCached('sentiment-classifier');

`predictBatch(inputs)`

Run inference on multiple inputs sequentially. Returns an array of results.

const results = await octomil.predictBatch([
  { text: 'Great product' },
  { text: 'Terrible experience' },
  { text: 'It was okay' },
]);
// results[0].label === "positive"
// results[1].label === "negative"

`clearCache(modelId?)`

Clear cached models. Pass a model ID to clear a specific model, or call with no arguments to clear all.

await octomil.clearCache('sentiment-classifier');  // Clear one
await octomil.clearCache();                         // Clear all

Device Authentication

import { DeviceAuthManager } from '@octomil/browser';

const auth = new DeviceAuthManager({
  serverUrl: 'https://api.octomil.com',
  apiKey: 'YOUR_CLIENT_KEY',
});

await auth.bootstrap('your-org-id');
const token = await auth.getToken();
// Use token for authenticated API calls

The manager generates a stable device ID by hashing browser fingerprint data (user agent, screen size, timezone, language) via SHA-256. Tokens auto-refresh 30 seconds before expiry.

Model Integrity Verification

Verify downloaded models haven't been tampered with:

import { verifyModelIntegrity, assertModelIntegrity } from '@octomil/browser';

const modelData = await fetch('model.onnx').then(r => r.arrayBuffer());

// Returns true/false
const valid = await verifyModelIntegrity(modelData, 'expected-sha256-hash');

// Throws if hash doesn't match
await assertModelIntegrity(modelData, 'expected-sha256-hash');

Federated Learning

Participate in server-coordinated federated training rounds from the browser.

import { FederatedClient, WeightExtractor } from '@octomil/browser';

const client = new FederatedClient({
  serverUrl: 'https://api.octomil.com',
  deviceId: 'device-abc',
});

// Join a training round
const round = await client.getTrainingRound('sentiment-v1');
await client.joinRound(round.roundId);

// Train locally (you provide the training logic)
const result = await client.train(round, {
  onTrainStep: async (weights) => {
    // Your training logic here — return updated weights
    return updatedWeights;
  },
});

// Submit weight updates
await client.submitUpdate(round.roundId, result.delta);

Weight Utilities

import { WeightExtractor } from '@octomil/browser';

// Compute difference between model snapshots
const delta = WeightExtractor.computeDelta(beforeWeights, afterWeights);

// Apply a delta to weights
const updated = WeightExtractor.applyDelta(weights, delta);

// Compute L2 norm (useful for clipping)
const norm = WeightExtractor.l2Norm(delta);

Secure Aggregation

Mask weight updates so the server never sees raw gradients:

import { SecureAggregation, SecAggPlus } from '@octomil/browser';

// ECDH key exchange between participants
const alice = new SecureAggregation();
const bob = new SecureAggregation();

const aliceKey = await alice.generateKeyPair();
const bobKey = await bob.generateKeyPair();

// Derive shared secret
const secret = await alice.deriveSharedSecret(bobKey.publicKey);

// SecAgg+ with threshold secret sharing
const secagg = new SecAggPlus(3); // threshold of 3
const shares = secagg.splitSecret(42, 5); // split into 5 shares
const reconstructed = secagg.reconstructSecret(shares.slice(0, 3));
// reconstructed === 42

Differential Privacy

Apply privacy-preserving transformations to weight updates before submission:

import { clipGradients, addGaussianNoise, quantize, dequantize } from '@octomil/browser';

// Clip gradient norms
const clipped = clipGradients(delta, 1.0);

// Add calibrated Gaussian noise (epsilon-DP)
const noisy = addGaussianNoise(clipped, 1.0, 1.0, 1e-5);

// Quantize for bandwidth reduction
const quantized = quantize(delta, 8); // 8-bit quantization
const restored = dequantize(quantized);

Canary Rollouts

Resolve which model version a device should use based on rollout configuration:

import { RolloutsManager } from '@octomil/browser';

const rollouts = new RolloutsManager({ serverUrl: 'https://api.octomil.com' });

// Deterministic assignment — same device always gets same version
const version = await rollouts.resolveVersion('sentiment-v1', 'device-abc');
// "1.0.0" (stable) or "2.0.0" (canary)

// Check canary eligibility
const inCanary = rollouts.isInCanaryGroup('sentiment-v1', 'device-abc', 10);

A/B Experiments

Run experiments with deterministic variant assignment across browser clients:

import { ExperimentsClient } from '@octomil/browser';

const experiments = new ExperimentsClient({ serverUrl: 'https://api.octomil.com' });

// Get active experiments
const active = await experiments.getActiveExperiments();

// Deterministic variant assignment
const variant = experiments.getVariant(active[0], 'device-abc');
console.log(variant.name); // "control" or "treatment"

// Resolve which model version to use for an experiment
const result = await experiments.resolveModelExperiment('sentiment-v1', 'device-abc');
if (result) {
  console.log(result.variant.modelVersion); // version assigned to this device
}

// Track experiment metrics
await experiments.trackMetric(experiment.id, variant.id, 'accuracy', 0.95, 'device-abc');

Telemetry

Telemetry is opt-in. When enabled, the SDK reports inference latency and model usage to your Octomil dashboard. No user data, prompt content, or prediction results are transmitted.

const octomil = new Octomil({
  apiKey: 'YOUR_CLIENT_KEY',
  telemetry: true,
});

Telemetry events appear in the Monitoring Dashboard alongside events from octomil serve, iOS, and Android SDKs. All platforms report to the same dashboard.

Cross-Platform Model Registry

The Browser SDK uses the same model registry as the server, iOS SDK, and Android SDK. A model uploaded once is available everywhere:

Upload model via dashboard or CLI
    |
    +--> octomil serve (desktop)
    +--> iOS SDK (CoreML)
    +--> Android SDK (TFLite)
    +--> Browser SDK (WebGPU/WASM)   <-- same model, same version

Model versioning, rollouts, and experiments apply uniformly. When you roll out version 2.0 to 10% of your fleet, that includes browser clients.

Full HTML Example

<!DOCTYPE html>
<html>
<head>
  <title>Octomil Browser Demo</title>
  <script src="https://cdn.octomil.com/@octomil/browser/dist/octomil.min.js"></script>
</head>
<body>
  <textarea id="input" placeholder="Type text to analyze..."></textarea>
  <button onclick="analyze()">Analyze Sentiment</button>
  <pre id="output"></pre>

  <script>
    const octomil = new Octomil({ apiKey: 'YOUR_CLIENT_KEY' });
    let model;

    // Pre-load model on page load
    (async () => {
      model = await octomil.load('sentiment-classifier');
      document.getElementById('output').textContent = 'Model ready.';
    })();

    async function analyze() {
      const text = document.getElementById('input').value;
      const result = await model.predict({ text });
      document.getElementById('output').textContent = JSON.stringify(result, null, 2);
    }
  </script>
</body>
</html>

Requirements

Any modern browser (Chrome, Firefox, Safari, Edge)
WebGPU for GPU acceleration (Chrome 113+, Edge 113+)
WASM support (all browsers released after 2017)

Gotchas

WebGPU is Chrome/Edge only (for now) — Firefox has WebGPU behind a flag. Safari has no support yet. WASM fallback works everywhere but is slower.
IndexedDB cache has browser limits — most browsers cap IndexedDB at 50-80% of available disk. Large models (500MB+) may not cache on low-storage devices.
First load downloads the model — there's no way to avoid the initial download. Show a loading indicator. Subsequent loads are instant from cache.
CORS headers required for self-hosted models — if you serve models from your own CDN instead of Octomil's, configure CORS headers or the browser will block the download.
Telemetry is opt-in — unlike the server SDKs, browser telemetry defaults to false. Set telemetry: true explicitly to see browser metrics in the dashboard.
chat() and chatStream() require serverUrl — these methods stream from the server via SSE. For pure on-device inference without a server, use predict().

Python SDK — server-side orchestration
iOS SDK — native iOS client
Android SDK — native Android client
Model Catalog — manage models across all platforms
Observability — monitor browser inference in the dashboard

Installation​

npm​

CDN Script Tag​

Quick Start​

Hardware Acceleration​

Model Caching​

Cache Management​

Supported Model Types​

Embeddings​

Smart Routing​

Cloud Fallback​

API Reference​

Octomil (Constructor)​

load(modelId, options?)​

predict(input)​

chat(messages, options?)​

chatStream(messages, options?)​

isCached(modelId)​

predictBatch(inputs)​

clearCache(modelId?)​

Device Authentication​

Model Integrity Verification​

Federated Learning​

Weight Utilities​

Secure Aggregation​

Differential Privacy​

Canary Rollouts​

A/B Experiments​

Telemetry​

Cross-Platform Model Registry​

Full HTML Example​

Requirements​

Gotchas​

Related​