Skip to main content

OpenAI Migration Guide

Octomil exposes an OpenAI-compatible chat completions and embeddings surface. Most applications can switch by changing the base URL and API key -- no other code changes required.

What changes

OpenAIOctomil
Base URLhttps://api.openai.com/v1http://localhost:8080/v1 (local via octomil serve) or https://api.octomil.com/v1 (hosted)
API keysk-...Not needed (local) or OCTOMIL_SERVER_KEY (hosted)
Modelsgpt-4o, gpt-3.5-turbophi-4-mini, gemma-3-4b, llama-3b, and more
InferenceCloud onlyOn-device, cloud fallback via routing
PricingPer-tokenFree on-device, per-token for cloud fallback

What stays the same

  • Request and response format (messages, temperature, max_tokens, stream)
  • Streaming via SSE (stream: true)
  • JSON mode and structured output (response_format)
  • Tool/function calling format
  • Error response structure

Quick swap

# Before
from openai import OpenAI
client = OpenAI(api_key="sk-...")

# After
from openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="http://localhost:8080/v1", # octomil serve
)

# Same code from here on
response = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

Start the local server

Before making requests, start octomil serve with your model:

octomil serve phi-4-mini

The server listens on http://localhost:8080 and exposes /v1/chat/completions, /v1/models, and /v1/embeddings.

Hosted API

To use Octomil's hosted inference instead of running locally, point at https://api.octomil.com/v1 and set your OCTOMIL_SERVER_KEY:

export OCTOMIL_SERVER_KEY="YOUR_SERVER_KEY"

curl https://api.octomil.com/v1/chat/completions \
-H "Authorization: Bearer $OCTOMIL_SERVER_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"phi-4-mini","messages":[{"role":"user","content":"Hello"}]}'

Feature mapping

OpenAI featureOctomil supportNotes
Chat completionsYesFull compatibility
StreamingYesSSE format, same event structure
JSON modeYesresponse_format: { type: "json_object" }
Structured outputYesJSON Schema via response_format -- see Control
Tool callingYesSame tools parameter format -- see Tool calling
EmbeddingsYes/v1/embeddings endpoint
Vision (images)PartialSupported on models with vision capability
Audio (Whisper)YesVia whisper.cpp engine
Assistants APINoUse Workflows instead
Fine-tuning APINoUse federated training instead
Batch APINo--

Choosing a model

Map your OpenAI models to Octomil equivalents based on your quality/speed requirements:

OpenAI modelOctomil equivalentTrade-off
gpt-4ophi-4-mini + cloud fallbackMost queries on-device, hard ones route to cloud
gpt-4o-minigemma3-1b or qwen-1.5bFully on-device, fast
gpt-3.5-turbosmollm-360mSmallest, fastest, constrained devices

Use routing to automatically escalate queries that exceed a small model's capability.

Gotchas

  • Examples are available -- if you want working LangChain, RAG, or Vercel AI SDK examples, see OpenAI-compatible integrations.
  • Model names differ -- there is no gpt-4o in Octomil. Use model names from the catalog.
  • Local server must be running -- octomil serve must be active for local inference. For fleet deployments, use octomil deploy instead.
  • Token limits are model-dependent -- on-device models have smaller context windows than cloud models. Check the model card for limits.
  • Rate limits don't apply locally -- on-device inference has no rate limits. Cloud fallback follows your org's plan limits.