OpenAI Migration Guide
Octomil exposes an OpenAI-compatible chat completions and embeddings surface. Most applications can switch by changing the base URL and API key -- no other code changes required.
What changes
| OpenAI | Octomil | |
|---|---|---|
| Base URL | https://api.openai.com/v1 | http://localhost:8080/v1 (local via octomil serve) or https://api.octomil.com/v1 (hosted) |
| API key | sk-... | Not needed (local) or OCTOMIL_SERVER_KEY (hosted) |
| Models | gpt-4o, gpt-3.5-turbo | phi-4-mini, gemma-3-4b, llama-3b, and more |
| Inference | Cloud only | On-device, cloud fallback via routing |
| Pricing | Per-token | Free on-device, per-token for cloud fallback |
What stays the same
- Request and response format (
messages,temperature,max_tokens,stream) - Streaming via SSE (
stream: true) - JSON mode and structured output (
response_format) - Tool/function calling format
- Error response structure
Quick swap
- Python (openai SDK)
- Node.js (openai SDK)
- cURL
# Before
from openai import OpenAI
client = OpenAI(api_key="sk-...")
# After
from openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="http://localhost:8080/v1", # octomil serve
)
# Same code from here on
response = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
// Before
import OpenAI from "openai";
const client = new OpenAI({ apiKey: "sk-..." });
// After
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "not-needed",
baseURL: "http://localhost:8080/v1",
});
// Same code from here on
const response = await client.chat.completions.create({
model: "phi-4-mini",
messages: [{ role: "user", content: "Hello" }],
});
console.log(response.choices[0].message.content);
# Before
curl https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer sk-..." \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"Hello"}]}'
# After
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer not-needed" \
-d '{"model":"phi-4-mini","messages":[{"role":"user","content":"Hello"}]}'
Start the local server
Before making requests, start octomil serve with your model:
octomil serve phi-4-mini
The server listens on http://localhost:8080 and exposes /v1/chat/completions, /v1/models, and /v1/embeddings.
Hosted API
To use Octomil's hosted inference instead of running locally, point at https://api.octomil.com/v1 and set your OCTOMIL_SERVER_KEY:
export OCTOMIL_SERVER_KEY="YOUR_SERVER_KEY"
curl https://api.octomil.com/v1/chat/completions \
-H "Authorization: Bearer $OCTOMIL_SERVER_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"phi-4-mini","messages":[{"role":"user","content":"Hello"}]}'
Feature mapping
| OpenAI feature | Octomil support | Notes |
|---|---|---|
| Chat completions | Yes | Full compatibility |
| Streaming | Yes | SSE format, same event structure |
| JSON mode | Yes | response_format: { type: "json_object" } |
| Structured output | Yes | JSON Schema via response_format -- see Control |
| Tool calling | Yes | Same tools parameter format -- see Tool calling |
| Embeddings | Yes | /v1/embeddings endpoint |
| Vision (images) | Partial | Supported on models with vision capability |
| Audio (Whisper) | Yes | Via whisper.cpp engine |
| Assistants API | No | Use Workflows instead |
| Fine-tuning API | No | Use federated training instead |
| Batch API | No | -- |
Choosing a model
Map your OpenAI models to Octomil equivalents based on your quality/speed requirements:
| OpenAI model | Octomil equivalent | Trade-off |
|---|---|---|
gpt-4o | phi-4-mini + cloud fallback | Most queries on-device, hard ones route to cloud |
gpt-4o-mini | gemma3-1b or qwen-1.5b | Fully on-device, fast |
gpt-3.5-turbo | smollm-360m | Smallest, fastest, constrained devices |
Use routing to automatically escalate queries that exceed a small model's capability.
Gotchas
- Examples are available -- if you want working LangChain, RAG, or Vercel AI SDK examples, see OpenAI-compatible integrations.
- Model names differ -- there is no
gpt-4oin Octomil. Use model names from the catalog. - Local server must be running --
octomil servemust be active for local inference. For fleet deployments, useoctomil deployinstead. - Token limits are model-dependent -- on-device models have smaller context windows than cloud models. Check the model card for limits.
- Rate limits don't apply locally -- on-device inference has no rate limits. Cloud fallback follows your org's plan limits.