Spanvero How it works Find a model Compare models Pricing

Quantization

Storing a model's weights at lower numeric precision (e.g. 4-bit instead of 16-bit) to shrink its memory footprint and speed it up, at the cost of a little accuracy — the single most important trick for running big models on ordinary hardware.

Models are trained using high-precision numbers, typically 16-bit floating point (and sometimes 32-bit). Every one of a model's billions of parameters is one of those numbers. Quantization is the process of re-encoding those weights with fewer bits — commonly 8-bit, 6-bit, 5-bit, or 4-bit — so the model takes far less memory and moves through the hardware faster. A 4-bit quant is roughly a quarter the size of the 16-bit original. In practical terms, that is very often the exact difference between a model fitting on your GPU and not fitting at all.

Here is why the memory saving is so large and so useful. A 70B model at full 16-bit precision needs on the order of 140 GB just for its weights — far beyond any single consumer card. Quantized to 4-bit it drops to roughly 40 GB, which brings it into reach of a couple of high-end GPUs or a well-specced workstation. A 7-8B model at 4-bit needs only about 4-5 GB of weights, so it runs comfortably on an entry-level card. That is the leverage quantization gives you.

The trade is accuracy for savings, and the trade is usually very favorable. Going from 16-bit to 8-bit typically causes negligible quality loss — most people can't tell the difference. 4-bit is the popular "sweet spot": you get roughly a 4x size reduction for a small, often barely noticeable, drop in quality. Below 4-bit — at 3-bit and especially 2-bit — the degradation becomes real and visible, showing up as more mistakes, worse reasoning, and less coherent long outputs. As a general pattern, larger models tolerate aggressive quantization better than small ones, because they have more redundancy to spare; squeezing a 3B model down to 2-bit hurts a lot more than doing the same to a 70B model.

A few important clarifications. Quantization happens after training and only affects how the weights are stored for inference — it does not retrain the model or change what it fundamentally learned. Modern quantization methods are also smarter than a naive "round every number to 4 bits." The K-quant methods used in the GGUF ecosystem work in blocks and keep the most important weights at higher precision, which is why a well-made 4-bit quant holds up better than the bit count alone would suggest. If you want the details of the naming, see the explainer on Q4_K_M and quant levels.

It's worth distinguishing the two broad families of quantization you'll encounter. Most local users deal with post-training quantization: a fully-trained model is simply compressed afterward into GGUF quants, with no extra training required — this is the fast, easy path that tools like Ollama and LM Studio use. The more advanced approach is quantization-aware training or fine-tuning on top of quantized weights (QLoRA is the best-known example), where a model is adapted with the low precision in mind, recovering some of the lost accuracy. For the vast majority of people just running models, post-training GGUF quants are all you need to think about.

One honest caveat: not every model quantizes equally well, and not every task is equally sensitive to it. Tasks that need precise reasoning, exact code, or careful math tend to show quantization damage sooner than casual chat does, and very small models suffer more than large ones at the same bit-width. So the right quant is workload-dependent — if you notice a quantized model making more mistakes on demanding work, stepping up one quant level (say Q4 to Q5 or Q6) often fixes it at a modest memory cost. The general advice is to run the highest quant your VRAM comfortably allows, leaving headroom for the context and KV cache.

Quantization is the reason "run it free on your own hardware" is a real option for so many models. It is also why the same model can show a much smaller memory requirement here than its full-precision size would imply. Spanvero's VRAM-to-run estimates assume a sensible default quant for each model, so the numbers you see reflect how people actually run these models — not the theoretical full-precision footprint. To see how the quant level changes the memory and cost for a specific model, use the calculator at /calculator/, and to filter straight to models that fit your card at their default quant, browse /models/8gb-vram/, /models/16gb-vram/, or /models/24gb-vram/.

Turn quantization into a real hardware decision

The useful question is not “is Q4 good?” but “which quant fits my card with room for context?”

Use the VRAM calculator → — Compare Q4, Q5 and higher precision on the same model.
Find models for your GPU → — Start from the hardware you own and see what fits.
Compare local vs rented → — See when an hourly GPU is cheaper than buying more hardware.

Q4_K_M and quant levels · GGUF · VRAM · Parameters (the "B" / billions) · Inference · Local vs API vs renting a GPU

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

Quantization

Turn quantization into a real hardware decision

Related

The weekly price index