What quantization should I use?

Use the highest quant your VRAM comfortably fits with headroom for context — for most people that's Q4_K_M, stepping up to Q5_K_M or Q6_K if you have room, and only dropping to Q3 or below to make a model fit at all.

Once you start downloading GGUF models you're immediately asked to pick a quant level from a list like Q4_K_M, Q5_K_M, Q6_K, Q8_0, Q3_K_M, and so on. The good news is there's a simple, reliable rule for choosing, and you don't need to understand every label to get it right.

First, the one-line answer: pick the highest quant that comfortably fits in your VRAM with headroom left for the context window and KV cache. Quantization stores a model's weights at lower numeric precision to shrink its memory footprint — fewer bits means a smaller file and less VRAM, at some cost to accuracy. So the whole game is trading precision for fit, and the winning move is to give up as little precision as your hardware allows.

Here's how the levels rank in practice. Q8_0 is near-lossless — the safest choice if memory is no object, but large (about half the full-precision size). Q6_K is very close to lossless with a nice size saving. Q5_K_M and Q4_K_M are the recommended general-purpose picks, and Q4_K_M is the usual default you'll see linked first because it hits the best overall balance of size, speed, and quality for most people on most hardware. Q3 variants start to show noticeable quality loss and are for squeezing a model onto tighter hardware. Q2 is a last resort when nothing else fits, with real degradation you should expect to see. The simple mental model: higher number means bigger and more accurate; lower number means smaller and faster but rougher.

The letters matter a little too. The "K" means it uses the modern K-quant method, which keeps the most important weights at higher precision instead of flatly rounding everything — that's why a good Q4_K_M holds up better than the bit count alone suggests. The trailing S/M/L means small/medium/large within the same bit-width; the larger variant keeps a bit more precision and is a bit bigger. When in doubt, the M (medium) variant is the sensible middle. You may also see newer "IQ" quants (like IQ4_XS) that squeeze quality into even smaller sizes, useful at the low end.

Model size changes the calculus in an important way: bigger models tolerate aggressive quantization far better than small ones, because they have more redundancy to spare. On a 70B you can lean into Q4 or even Q3 confidently. On a 7B, stay at Q4 or above if you possibly can, since low-bit quantization bites harder on small models. So "which quant" isn't one answer — it shifts with the model you're running.

Task sensitivity is the other honest wrinkle. Casual chat tolerates lower quants well; tasks that need precise reasoning, exact code, or careful math show quantization damage sooner. If you notice a quantized model making more mistakes on demanding work, stepping up one level (say Q4 to Q5 or Q6) often fixes it at a modest memory cost. So the practical loop is: start at Q4_K_M, go higher if you have VRAM to spare and want more quality, and go lower only to make something fit or to free memory for a longer context.

A quick way to estimate the memory a quant needs: multiply the parameter count by the bits per weight and divide by eight. A 7B at 4-bit is roughly 7 × 4 / 8 ≈ 3.5 GB of weights (a bit more in practice — around 4-5 GB — since K-quants keep some tensors higher). That lets you sanity-check whether a given quant will fit before you download several gigabytes. One counterintuitive note: a lower quant isn't reliably faster in wall-clock terms — generation is often limited by memory bandwidth, not file size — so choose a lower quant to fit or to gain context, not primarily for speed.

Spanvero uses a sensible default quant per model (typically the Q4_K_M-class option) when it computes VRAM-to-run and run cost, so its memory figures reflect a realistic download rather than the full-precision weights. To see exactly how changing the quant level changes the memory and cost for a specific model, open /calculator/, and to filter straight to models that fit your card at their default quant, browse /models/8gb-vram/, /models/16gb-vram/, or /models/24gb-vram/.

Related

Q4_K_M and quant levels · Quantization · GGUF vs safetensors — which should I download? · VRAM · What LLMs can I run on 8GB of VRAM? · How do I run my first local AI model? · Parameters (the "B" / billions) · KV cache

All explainers → · Browse models →

Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.