Skip to main content

Model Optimizer

Octomil automatically picks the best quantization, GPU offload, and runtime settings when you serve or pull a model. No configuration needed.

It Just Works

octomil serve llama3.1:8b
  Hardware optimization
Quantization: Q6_K
Strategy: full_gpu (all GPU layers)
VRAM: 7.2 GB RAM: 0.0 GB
Est. speed: 38.5 tok/s (high confidence)
Serving as: llama3.1:8b:q6_k

Detecting engines...
+ MLX (Apple Silicon)
+ llama.cpp (GGUF)

Starting Octomil serve on 0.0.0.0:8080

Same for pull:

octomil pull llama3.1:8b
# detects hardware, picks Q6_K, pulls llama3.1:8b:q6_k

Override with an explicit quantization to skip auto-optimization:

octomil serve llama3.1:8b:q4_k_m    # explicit — no auto-optimize
octomil serve llama3.1:8b:fp16 # full precision

How It Works

On every serve or pull, Octomil:

  1. Detects hardware — GPU model, VRAM, RAM, CPU features
  2. Picks quantization — tries Q8_0 down to Q2_K, picks the best quality that fits
  3. Chooses offload strategy — full GPU, partial offload, or CPU-only
  4. Appends the optimal variantllama3.1:8b becomes llama3.1:8b:q6_k

Memory Strategies

StrategyGPU LayersWhen Used
full_gpuAllModel + KV cache fits entirely in VRAM
partial_offloadSomeModel partly fits in VRAM, rest in RAM
cpu_onlyNoneNo GPU available, or model too large
aggressive_quantVariesForces Q2_K with a quality warning

Quantization Levels

Tried from highest quality to lowest — the best that fits your hardware is selected:

QuantizationBytes/ParamQuality
Q8_01.0Excellent
Q6_K0.8125Very good
Q5_K_M0.75Good
Q4_K_M0.625Good
Q4_00.5Acceptable
Q3_K_M0.4375Reduced
Q2_K0.3125Low

Higher-VRAM systems (24 GB+) get Q8_0, which most tools skip.

Multi-GPU

VRAM is summed across all GPUs:

# Two RTX 4090s (48 GB total) — 70B @ Q4_K_M fits fully in GPU
octomil serve llama3.1:70b

Apple Silicon

Unified memory is treated as VRAM (minus 4 GB OS reserve):

# M4 Max 128 GB — 124 GB usable — 70B @ Q6_K fits fully
octomil serve llama3.1:70b