Supported Models
Octomil ships with a curated catalog of models that work out of the box. Run any model with:
octomil serve <model-name>
You can also use any HuggingFace model by passing the repo ID directly:
octomil serve mlx-community/gemma-3-4b-it-4bit
Model catalog
| Model | Params | MLX (Apple Silicon) | llama.cpp (CPU/CUDA) | Best for |
|---|---|---|---|---|
smollm-360m | 360M | 4-bit | Q4_K_M | Testing, prototyping, constrained devices |
gemma-1b | 1B | 4-bit | Q4_K_M | Fast responses, low-memory devices |
qwen-1.5b | 1.5B | 4-bit | Q4_K_M | Multilingual, code |
llama-1b | 1B | 4-bit | Q4_K_M | General chat |
llama-3b | 3B | 4-bit | Q4_K_M | Balanced speed/quality |
qwen-3b | 3B | 4-bit | Q4_K_M | Multilingual, code, reasoning |
gemma-4b | 4B | 4-bit | Q4_K_M | Strong general-purpose |
phi-mini | 3.8B | 4-bit | Q4_K_M | Code, reasoning, compact |
mistral-7b | 7B | 4-bit | Q4_K_M | High-quality general-purpose |
qwen-7b | 7B | 4-bit | Q4_K_M | Multilingual, long context |
llama-8b | 8B | 4-bit | Q4_K_M | Best open-source at this size |
phi-4 | 14B | 4-bit | - | Strong reasoning, needs 16GB+ RAM |
gemma-12b | 12B | 4-bit | - | High quality, needs 16GB+ RAM |
gemma-27b | 27B | 4-bit | - | Near-frontier quality, needs 32GB+ RAM |
Engine selection
Octomil auto-selects the fastest engine for your hardware:
| Engine | Platform | Acceleration | Priority |
|---|---|---|---|
| MLX | Apple Silicon Mac | Metal GPU (unified memory) | 1st |
| MNN-LLM | All platforms | CPU + GPU | 2nd |
| llama.cpp | All platforms | Metal / CUDA / CPU | 3rd |
| ExecuTorch | All platforms | CoreML / QNN / XNNPACK | 4th |
Override with --engine:
octomil serve phi-mini --engine llama.cpp
Device deployment formats
When deploying to mobile devices with octomil deploy --phone, the server automatically selects the optimal format:
| Device tier | Example devices | Format | Quantization | Max model size |
|---|---|---|---|---|
| Flagship | iPhone 15 Pro+, Galaxy S24+, Pixel 9 | CoreML / TFLite float16 | float16 | 2 GB |
| High | iPhone 14 Pro, Galaxy S22, Pixel 7 | CoreML / TFLite | float16-int8 | 1 GB |
| Mid | iPhone 12, Galaxy A54 | CoreML GPU / TFLite | int8 | 500 MB |
| Low | Older Android (under 4GB RAM) | TFLite int8 | int8 | 200 MB |
See Deploy Compatibility for details on device tier classification and format resolution.
Using HuggingFace models
Any HuggingFace model that's compatible with MLX or llama.cpp works:
# MLX format (Apple Silicon)
octomil serve mlx-community/Qwen2.5-Coder-7B-Instruct-4bit
# GGUF format (any platform)
octomil serve bartowski/Llama-3.2-3B-Instruct-GGUF --engine llama.cpp
RAM requirements
Rough guide for 4-bit quantized models:
| Model size | RAM needed | Devices |
|---|---|---|
| 360M-1B | 1-2 GB | Any modern device |
| 3B-4B | 3-4 GB | 8GB+ Mac, flagship phones |
| 7B-8B | 5-6 GB | 16GB+ Mac |
| 12B-14B | 8-10 GB | 16GB+ Mac |
| 27B | 16-18 GB | 32GB+ Mac |