Model Optimizer

Octomil automatically picks the best quantization, GPU offload, and runtime settings when you serve or pull a model. No configuration needed.

It Just Works

octomil serve llama3.1:8b

  Hardware optimization
    Quantization: Q6_K
    Strategy: full_gpu (all GPU layers)
    VRAM: 7.2 GB  RAM: 0.0 GB
    Est. speed: 38.5 tok/s (high confidence)
    Serving as: llama3.1:8b:q6_k

  Detecting engines...
    + MLX (Apple Silicon)
    + llama.cpp (GGUF)

  Starting Octomil serve on 0.0.0.0:8080

Same for pull:

octomil pull llama3.1:8b
# detects hardware, picks Q6_K, pulls llama3.1:8b:q6_k

Override with an explicit quantization to skip auto-optimization:

octomil serve llama3.1:8b:q4_k_m    # explicit — no auto-optimize
octomil serve llama3.1:8b:fp16      # full precision

How It Works

On every serve or pull, Octomil:

Detects hardware — GPU model, VRAM, RAM, CPU features
Picks quantization — tries Q8_0 down to Q2_K, picks the best quality that fits
Chooses offload strategy — full GPU, partial offload, or CPU-only
Appends the optimal variant — llama3.1:8b becomes llama3.1:8b:q6_k

Memory Strategies

Strategy	GPU Layers	When Used
`full_gpu`	All	Model + KV cache fits entirely in VRAM
`partial_offload`	Some	Model partly fits in VRAM, rest in RAM
`cpu_only`	None	No GPU available, or model too large
`aggressive_quant`	Varies	Forces Q2_K with a quality warning

Quantization Levels

Tried from highest quality to lowest — the best that fits your hardware is selected:

Quantization	Bytes/Param	Quality
Q8_0	1.0	Excellent
Q6_K	0.8125	Very good
Q5_K_M	0.75	Good
Q4_K_M	0.625	Good
Q4_0	0.5	Acceptable
Q3_K_M	0.4375	Reduced
Q2_K	0.3125	Low

Higher-VRAM systems (24 GB+) get Q8_0, which most tools skip.

Multi-GPU

VRAM is summed across all GPUs:

# Two RTX 4090s (48 GB total) — 70B @ Q4_K_M fits fully in GPU
octomil serve llama3.1:70b

Apple Silicon

Unified memory is treated as VRAM (minus 4 GB OS reserve):

# M4 Max 128 GB — 124 GB usable — 70B @ Q6_K fits fully
octomil serve llama3.1:70b

It Just Works​

How It Works​

Memory Strategies​

Quantization Levels​

Multi-GPU​

Apple Silicon​