Skip to main content

Supported Models

Octomil ships with a curated catalog of models that work out of the box. Run any model with:

octomil serve <model-name>

You can also use any HuggingFace model by passing the repo ID directly:

octomil serve mlx-community/gemma-3-4b-it-4bit

Model catalog

ModelParamsMLX (Apple Silicon)llama.cpp (CPU/CUDA)Best for
smollm-360m360M4-bitQ4_K_MTesting, prototyping, constrained devices
gemma3-1b1B4-bitQ4_K_MFast responses, low-memory devices
qwen-1.5b1.5B4-bitQ4_K_MMultilingual, code
llama-1b1B4-bitQ4_K_MGeneral chat
llama-3b3B4-bitQ4_K_MBalanced speed/quality
qwen-3b3B4-bitQ4_K_MMultilingual, code, reasoning
gemma3-4b4B4-bitQ4_K_MStrong general-purpose
phi-mini3.8B4-bitQ4_K_MCode, reasoning, compact
mistral-7b7B4-bitQ4_K_MHigh-quality general-purpose
qwen-7b7B4-bitQ4_K_MMultilingual, long context
llama-8b8B4-bitQ4_K_MBest open-source at this size
phi-414B4-bit-Strong reasoning, needs 16GB+ RAM
gemma-12b12B4-bit-High quality, needs 16GB+ RAM
gemma-27b27B4-bit-Near-frontier quality, needs 32GB+ RAM

Engine selection

Octomil auto-selects the fastest engine for your hardware:

EnginePlatformAccelerationPriority
MLXApple Silicon MacMetal GPU (unified memory)1st
MNN-LLMAll platformsCPU + GPU2nd
llama.cppAll platformsMetal / CUDA / CPU3rd
ExecuTorchAll platformsCoreML / QNN / XNNPACK4th

Override with --engine:

octomil serve phi-mini --engine llama.cpp

Device deployment formats

When deploying to mobile devices with octomil deploy --phone, the server automatically selects the optimal ArtifactFormat and ExecutorDelegate:

Device classExample devicesArtifact formatQuantizationMax model size
flagshipiPhone 15 Pro+, Galaxy S24+, Pixel 9coreml (ANE) / tflite (NNAPI)float162 GB
highiPhone 14, Galaxy S22, Pixel 7coreml (ANE) / tflite (GPU)float16-int81 GB
midiPhone 12, Galaxy A54coreml (GPU) / tflite (XNNPACK)int8500 MB
lowOlder Android (under 4GB RAM)tflite (XNNPACK)int8200 MB

See Deploy Compatibility for details on DeviceClass classification and format resolution.

Using HuggingFace models

Any HuggingFace model that's compatible with MLX or llama.cpp works:

# MLX format (Apple Silicon)
octomil serve mlx-community/Qwen2.5-Coder-7B-Instruct-4bit

# GGUF format (any platform)
octomil serve bartowski/Llama-3.2-3B-Instruct-GGUF --engine llama.cpp

RAM requirements

Rough guide for 4-bit quantized models:

Model sizeRAM neededDevices
360M-1B1-2 GBAny modern device
3B-4B3-4 GB8GB+ Mac, flagship phones
7B-8B5-6 GB16GB+ Mac
12B-14B8-10 GB16GB+ Mac
27B16-18 GB32GB+ Mac