Storing a model's weights at lower numeric precision (e.g. 4-bit instead of 16-bit) to shrink its memory footprint and speed it up, at the cost of some accuracy.
Models are trained with high-precision numbers (typically 16-bit floats). Quantization compresses those weights to fewer bits — commonly 8-bit, 5-bit, or 4-bit — so the model takes far less memory and runs faster. A 4-bit quant is roughly a quarter the size of the 16-bit original, which is often the difference between a model fitting on your GPU or not.
Quantization trades a bit of accuracy for a lot of savings. The quality loss from 8-bit is usually negligible; 4-bit is the popular "sweet spot" with modest loss; below 4-bit (3-bit, 2-bit) the degradation becomes noticeable, especially on smaller models. Larger models tolerate aggressive quantization better than small ones.
Quantization happens after training and only affects how weights are stored for inference — it does not retrain the model. Spanvero's VRAM-to-run estimates assume a default quant level, which is why a model can be listed as runnable on hardware that couldn't hold its full-precision version.
Q4_K_M and quant levels · GGUF · VRAM · Parameters (the "B" / billions)
All explainers → · Browse models →
Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.