Spanvero How it works Find a model Compare models Pricing

Q4_K_M and quant levels

A naming scheme for llama.cpp/GGUF quantized models — Q4_K_M means 4-bit, "K-quant" method, Medium size — and it is the most commonly recommended balance of size and quality for local use.

Once you start downloading models to run locally, you immediately hit a wall of cryptic labels: Q4_K_M, Q5_K_S, Q8_0, Q6_K, Q3_K_L, IQ4_XS. These are quantization levels in the GGUF format, and decoding them is one of the most useful five-minute skills for anyone running open models. The good news is the scheme is simple once you know the pattern.

Read the label left to right. The "Q" stands for quantized. The number right after it is the number of bits per weight — so Q4 is 4-bit, Q5 is 5-bit, Q8 is 8-bit. Fewer bits means a smaller file and less memory, at some cost to accuracy. The letters after the number describe the method and the size variant. "K" means it uses the K-quant (super-block) method, which is the modern default. The trailing S, M, or L means small, medium, or large — within the same bit-width, the larger variant keeps more precision on the important tensors and so is a bit bigger and a bit more accurate. So Q4_K_M reads as: 4-bit, K-quant method, Medium variant.

The reason K-quants matter is that they do not store every weight at the same precision. They work in fixed-size blocks and selectively keep high-impact weights — such as the attention and feed-forward projection tensors that most affect output quality — at higher precision, while compressing the rest more aggressively. This is why Q4_K_M consistently holds up better than a naive, flat 4-bit quantization would: the bits are spent where they matter most. You'll also encounter "IQ" quants (like IQ4_XS or IQ2_M), a newer importance-matrix family that can squeeze quality into even smaller sizes, especially at the low end.

Here is the practical guidance most of the community converges on. Q8_0 is near-lossless and the safest choice if memory is no object, but it is large — about half the full-precision size. Q6_K is very close to lossless with a nice size saving. Q5_K_M and Q4_K_M are the recommended general-purpose picks, and Q4_K_M is the usual default you'll see linked first — it hits the best overall balance of size, speed, and quality for most people on most hardware. Q3 variants start to show noticeable quality loss and are for squeezing a model onto tighter hardware; Q2 is a last resort when nothing else fits, and you should expect real degradation, especially on smaller models. The simple mental model: higher number = bigger file and more accurate; lower number = smaller and faster but rougher.

Which one should you actually download? Pick the highest quant that comfortably fits your VRAM with headroom left over for the context window and KV cache. If a model in Q4_K_M leaves you room to spare, try Q5_K_M or Q6_K for a bit more quality. If it barely fits, or you want a longer context, step down to a smaller variant. Bigger models are more forgiving of low bit-widths, so on a 70B you can lean into Q4 or even Q3 confidently, whereas on a 7B you'll want to stay at Q4 or above if you can.

A quick way to estimate the file size, which closely tracks the VRAM the weights need: multiply the parameter count by the bits per weight and divide by eight to get bytes. A 7B model at 4-bit is roughly 7 × 4 / 8 ≈ 3.5 GB of weights (a bit more in practice, since K-quants keep some tensors at higher precision, landing around 4-4.5 GB); the same model at 8-bit is about 7 GB, and at full 16-bit about 14 GB. This is why the quant column on a download page usually shows the file size climbing as the quant number rises — you're literally trading disk and memory for precision.

One subtlety that trips people up: a lower quant is not always faster in wall-clock terms, even though it's smaller. On a GPU with plenty of VRAM, a higher quant can run at similar speed because generation is often limited by memory bandwidth and compute rather than raw size, and some low-bit formats add a small dequantization overhead. The main reason to go lower is to make a model fit, or to free VRAM for a longer context — not primarily for speed. So the honest heuristic remains: fit first, then quality, then worry about speed.

These quant labels are a llama.cpp / GGUF convention, so you'll see them anywhere GGUF is used — Ollama, LM Studio, and llama.cpp itself. Spanvero uses a sensible default quant per model (typically the Q4_K_M-class option) when computing VRAM-to-run and run cost, which is why the memory figures reflect a realistic download rather than the full-precision weights. To see how choosing a different quant changes the memory needed for a given model, use the calculator at /calculator/.

Pick a quant your hardware can actually hold

Q4_K_M is a strong default, but the right answer is the highest useful quant that leaves context headroom.

Calculate exact VRAM → — Change the quant and context window to see how the memory requirement moves.
Browse 8 GB models → — A practical starting point for laptop and entry-level GPU owners.
Browse 16 GB models → — See the larger local models that fit a common enthusiast card.

Quantization · GGUF · VRAM · llama.cpp · Parameters (the "B" / billions) · Ollama

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

Q4_K_M and quant levels

Pick a quant your hardware can actually hold

Related

The weekly price index