OpenAI Migration Guide

Octomil exposes an OpenAI-compatible chat completions and embeddings surface. Most applications can switch by changing the base URL and API key -- no other code changes required.

What changes

	OpenAI	Octomil
Base URL	`https://api.openai.com/v1`	`http://localhost:8080/v1` (local via `octomil serve`) or `https://api.octomil.com/v1` (hosted)
API key	`sk-...`	Not needed (local) or `OCTOMIL_SERVER_KEY` (hosted)
Models	`gpt-4o`, `gpt-3.5-turbo`	`phi-4-mini`, `gemma-3-4b`, `llama-3b`, and more
Inference	Cloud only	On-device, cloud fallback via routing
Pricing	Per-token	Free on-device, per-token for cloud fallback

What stays the same

Request and response format (messages, temperature, max_tokens, stream)
Streaming via SSE (stream: true)
JSON mode and structured output (response_format)
Tool/function calling format
Error response structure

Quick swap

Python (openai SDK)
Node.js (openai SDK)
cURL

# Before
from openai import OpenAI
client = OpenAI(api_key="sk-...")

# After
from openai import OpenAI
client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8080/v1",  # octomil serve
)

# Same code from here on
response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

// Before
import OpenAI from "openai";
const client = new OpenAI({ apiKey: "sk-..." });

// After
import OpenAI from "openai";
const client = new OpenAI({
  apiKey: "not-needed",
  baseURL: "http://localhost:8080/v1",
});

// Same code from here on
const response = await client.chat.completions.create({
  model: "phi-4-mini",
  messages: [{ role: "user", content: "Hello" }],
});
console.log(response.choices[0].message.content);

# Before
curl https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer sk-..." \
  -d '{"model":"gpt-4o","messages":[{"role":"user","content":"Hello"}]}'

# After
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer not-needed" \
  -d '{"model":"phi-4-mini","messages":[{"role":"user","content":"Hello"}]}'

Start the local server

Before making requests, start octomil serve with your model:

octomil serve phi-4-mini

The server listens on http://localhost:8080 and exposes /v1/chat/completions, /v1/models, and /v1/embeddings.

Hosted API

To use Octomil's hosted inference instead of running locally, point at https://api.octomil.com/v1 and set your OCTOMIL_SERVER_KEY:

export OCTOMIL_SERVER_KEY="YOUR_SERVER_KEY"

curl https://api.octomil.com/v1/chat/completions \
  -H "Authorization: Bearer $OCTOMIL_SERVER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"phi-4-mini","messages":[{"role":"user","content":"Hello"}]}'

Feature mapping

OpenAI feature	Octomil support	Notes
Chat completions	Yes	Full compatibility
Streaming	Yes	SSE format, same event structure
JSON mode	Yes	`response_format: { type: "json_object" }`
Structured output	Yes	JSON Schema via `response_format` -- see Control
Tool calling	Yes	Same `tools` parameter format -- see Tool calling
Embeddings	Yes	`/v1/embeddings` endpoint
Vision (images)	Partial	Supported on models with vision capability
Audio (Whisper)	Yes	Via whisper.cpp engine
Assistants API	No	Use Workflows instead
Fine-tuning API	No	Use federated training instead
Batch API	No	--

Choosing a model

Map your OpenAI models to Octomil equivalents based on your quality/speed requirements:

OpenAI model	Octomil equivalent	Trade-off
`gpt-4o`	`phi-4-mini` + cloud fallback	Most queries on-device, hard ones route to cloud
`gpt-4o-mini`	`gemma3-1b` or `qwen-1.5b`	Fully on-device, fast
`gpt-3.5-turbo`	`smollm-360m`	Smallest, fastest, constrained devices

Use routing to automatically escalate queries that exceed a small model's capability.

Gotchas

Examples are available -- if you want working LangChain, RAG, or Vercel AI SDK examples, see OpenAI-compatible integrations.
Model names differ -- there is no gpt-4o in Octomil. Use model names from the catalog.
Local server must be running -- octomil serve must be active for local inference. For fleet deployments, use octomil deploy instead.
Token limits are model-dependent -- on-device models have smaller context windows than cloud models. Check the model card for limits.
Rate limits don't apply locally -- on-device inference has no rate limits. Cloud fallback follows your org's plan limits.

What changes​

What stays the same​

Quick swap​

Start the local server​

Hosted API​

Feature mapping​

Choosing a model​

Gotchas​