Supported Models

Octomil ships with a curated catalog of models that work out of the box. Run any model with:

octomil serve <model-name>

You can also use any HuggingFace model by passing the repo ID directly:

octomil serve mlx-community/gemma-3-4b-it-4bit

Model catalog

Model	Params	MLX (Apple Silicon)	llama.cpp (CPU/CUDA)	Best for
`smollm-360m`	360M	4-bit	Q4_K_M	Testing, prototyping, constrained devices
`gemma3-1b`	1B	4-bit	Q4_K_M	Fast responses, low-memory devices
`qwen-1.5b`	1.5B	4-bit	Q4_K_M	Multilingual, code
`llama-1b`	1B	4-bit	Q4_K_M	General chat
`llama-3b`	3B	4-bit	Q4_K_M	Balanced speed/quality
`qwen-3b`	3B	4-bit	Q4_K_M	Multilingual, code, reasoning
`gemma3-4b`	4B	4-bit	Q4_K_M	Strong general-purpose
`phi-mini`	3.8B	4-bit	Q4_K_M	Code, reasoning, compact
`mistral-7b`	7B	4-bit	Q4_K_M	High-quality general-purpose
`qwen-7b`	7B	4-bit	Q4_K_M	Multilingual, long context
`llama-8b`	8B	4-bit	Q4_K_M	Best open-source at this size
`phi-4`	14B	4-bit	-	Strong reasoning, needs 16GB+ RAM
`gemma-12b`	12B	4-bit	-	High quality, needs 16GB+ RAM
`gemma-27b`	27B	4-bit	-	Near-frontier quality, needs 32GB+ RAM

Engine selection

Octomil auto-selects the fastest engine for your hardware:

Engine	Platform	Acceleration	Priority
MLX	Apple Silicon Mac	Metal GPU (unified memory)	1st
MNN-LLM	All platforms	CPU + GPU	2nd
llama.cpp	All platforms	Metal / CUDA / CPU	3rd
ExecuTorch	All platforms	CoreML / QNN / XNNPACK	4th

Override with --engine:

octomil serve phi-mini --engine llama.cpp

Device deployment formats

When deploying to mobile devices with octomil deploy --phone, the server automatically selects the optimal ArtifactFormat and ExecutorDelegate:

Device class	Example devices	Artifact format	Quantization	Max model size
`flagship`	iPhone 15 Pro+, Galaxy S24+, Pixel 9	`coreml` (ANE) / `tflite` (NNAPI)	float16	2 GB
`high`	iPhone 14, Galaxy S22, Pixel 7	`coreml` (ANE) / `tflite` (GPU)	float16-int8	1 GB
`mid`	iPhone 12, Galaxy A54	`coreml` (GPU) / `tflite` (XNNPACK)	int8	500 MB
`low`	Older Android (under 4GB RAM)	`tflite` (XNNPACK)	int8	200 MB

See Deploy Compatibility for details on DeviceClass classification and format resolution.

Using HuggingFace models

Any HuggingFace model that's compatible with MLX or llama.cpp works:

# MLX format (Apple Silicon)
octomil serve mlx-community/Qwen2.5-Coder-7B-Instruct-4bit

# GGUF format (any platform)
octomil serve bartowski/Llama-3.2-3B-Instruct-GGUF --engine llama.cpp

RAM requirements

Rough guide for 4-bit quantized models:

Model size	RAM needed	Devices
360M-1B	1-2 GB	Any modern device
3B-4B	3-4 GB	8GB+ Mac, flagship phones
7B-8B	5-6 GB	16GB+ Mac
12B-14B	8-10 GB	16GB+ Mac
27B	16-18 GB	32GB+ Mac

Model catalog​

Engine selection​

Device deployment formats​

Using HuggingFace models​

RAM requirements​

Model catalog

Engine selection

Device deployment formats

Using HuggingFace models

RAM requirements