Model Optimizer
Octomil automatically picks the best quantization, GPU offload, and runtime settings when you serve or pull a model. No configuration needed.
It Just Works
octomil serve llama3.1:8b
Hardware optimization
Quantization: Q6_K
Strategy: full_gpu (all GPU layers)
VRAM: 7.2 GB RAM: 0.0 GB
Est. speed: 38.5 tok/s (high confidence)
Serving as: llama3.1:8b:q6_k
Detecting engines...
+ MLX (Apple Silicon)
+ llama.cpp (GGUF)
Starting Octomil serve on 0.0.0.0:8080
Same for pull:
octomil pull llama3.1:8b
# detects hardware, picks Q6_K, pulls llama3.1:8b:q6_k
Override with an explicit quantization to skip auto-optimization:
octomil serve llama3.1:8b:q4_k_m # explicit — no auto-optimize
octomil serve llama3.1:8b:fp16 # full precision
How It Works
On every serve or pull, Octomil:
- Detects hardware — GPU model, VRAM, RAM, CPU features
- Picks quantization — tries Q8_0 down to Q2_K, picks the best quality that fits
- Chooses offload strategy — full GPU, partial offload, or CPU-only
- Appends the optimal variant —
llama3.1:8bbecomesllama3.1:8b:q6_k
Memory Strategies
| Strategy | GPU Layers | When Used |
|---|---|---|
full_gpu | All | Model + KV cache fits entirely in VRAM |
partial_offload | Some | Model partly fits in VRAM, rest in RAM |
cpu_only | None | No GPU available, or model too large |
aggressive_quant | Varies | Forces Q2_K with a quality warning |
Quantization Levels
Tried from highest quality to lowest — the best that fits your hardware is selected:
| Quantization | Bytes/Param | Quality |
|---|---|---|
| Q8_0 | 1.0 | Excellent |
| Q6_K | 0.8125 | Very good |
| Q5_K_M | 0.75 | Good |
| Q4_K_M | 0.625 | Good |
| Q4_0 | 0.5 | Acceptable |
| Q3_K_M | 0.4375 | Reduced |
| Q2_K | 0.3125 | Low |
Higher-VRAM systems (24 GB+) get Q8_0, which most tools skip.
Multi-GPU
VRAM is summed across all GPUs:
# Two RTX 4090s (48 GB total) — 70B @ Q4_K_M fits fully in GPU
octomil serve llama3.1:70b
Apple Silicon
Unified memory is treated as VRAM (minus 4 GB OS reserve):
# M4 Max 128 GB — 124 GB usable — 70B @ Q6_K fits fully
octomil serve llama3.1:70b