A naming scheme for llama.cpp/GGUF quantized models — Q4_K_M means 4-bit, "K-quant" method, Medium size — and is the most commonly recommended balance of size and quality.
In the GGUF ecosystem, quant labels read as: "Q" + the number of bits + a method/size suffix. So Q4_K_M is 4-bit, the "K-quant" (super-block) method, with an "M" (medium) variant; the suffixes S/M/L mean small/medium/large, where larger keeps more precision on important tensors. You'll also see Q2_K, Q3_K_S/M/L, Q5_K_M, Q6_K, and Q8_0.
K-quants don't store every weight at the same precision: they work in fixed-size blocks and keep selected high-impact weights (such as the attention and feed-forward projection tensors) at higher precision, which is why Q4_K_M holds up better than a naive flat 4-bit.
The practical guidance most people follow: Q8_0 is near-lossless but big; Q5_K_M and Q4_K_M are the recommended general-purpose picks (Q4_K_M is the usual default); Q3 and Q2 are for squeezing onto tight hardware and lose noticeable quality. Higher number = bigger and more accurate; lower = smaller and faster.
Quantization · GGUF · VRAM · llama.cpp
All explainers → · Browse models →
Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.