Skip to main content

Supported Models

Octomil ships with a curated catalog of models that work out of the box. Run any model with:

octomil serve <model-name>

You can also use any HuggingFace model by passing the repo ID directly:

octomil serve mlx-community/gemma-3-4b-it-4bit

Model catalog

ModelParamsMLX (Apple Silicon)llama.cpp (CPU/CUDA)Best for
smollm-360m360M4-bitQ4_K_MTesting, prototyping, constrained devices
gemma-1b1B4-bitQ4_K_MFast responses, low-memory devices
qwen-1.5b1.5B4-bitQ4_K_MMultilingual, code
llama-1b1B4-bitQ4_K_MGeneral chat
llama-3b3B4-bitQ4_K_MBalanced speed/quality
qwen-3b3B4-bitQ4_K_MMultilingual, code, reasoning
gemma-4b4B4-bitQ4_K_MStrong general-purpose
phi-mini3.8B4-bitQ4_K_MCode, reasoning, compact
mistral-7b7B4-bitQ4_K_MHigh-quality general-purpose
qwen-7b7B4-bitQ4_K_MMultilingual, long context
llama-8b8B4-bitQ4_K_MBest open-source at this size
phi-414B4-bit-Strong reasoning, needs 16GB+ RAM
gemma-12b12B4-bit-High quality, needs 16GB+ RAM
gemma-27b27B4-bit-Near-frontier quality, needs 32GB+ RAM

Engine selection

Octomil auto-selects the fastest engine for your hardware:

EnginePlatformAccelerationPriority
MLXApple Silicon MacMetal GPU (unified memory)1st
MNN-LLMAll platformsCPU + GPU2nd
llama.cppAll platformsMetal / CUDA / CPU3rd
ExecuTorchAll platformsCoreML / QNN / XNNPACK4th

Override with --engine:

octomil serve phi-mini --engine llama.cpp

Device deployment formats

When deploying to mobile devices with octomil deploy --phone, the server automatically selects the optimal format:

Device tierExample devicesFormatQuantizationMax model size
FlagshipiPhone 15 Pro+, Galaxy S24+, Pixel 9CoreML / TFLite float16float162 GB
HighiPhone 14 Pro, Galaxy S22, Pixel 7CoreML / TFLitefloat16-int81 GB
MidiPhone 12, Galaxy A54CoreML GPU / TFLiteint8500 MB
LowOlder Android (under 4GB RAM)TFLite int8int8200 MB

See Deploy Compatibility for details on device tier classification and format resolution.

Using HuggingFace models

Any HuggingFace model that's compatible with MLX or llama.cpp works:

# MLX format (Apple Silicon)
octomil serve mlx-community/Qwen2.5-Coder-7B-Instruct-4bit

# GGUF format (any platform)
octomil serve bartowski/Llama-3.2-3B-Instruct-GGUF --engine llama.cpp

RAM requirements

Rough guide for 4-bit quantized models:

Model sizeRAM neededDevices
360M-1B1-2 GBAny modern device
3B-4B3-4 GB8GB+ Mac, flagship phones
7B-8B5-6 GB16GB+ Mac
12B-14B8-10 GB16GB+ Mac
27B16-18 GB32GB+ Mac