KV cache

A per-request memory buffer that stores the attention keys and values for tokens already processed, so the model doesn't recompute them on every new token — it speeds up generation but consumes VRAM that grows with context length.

As a model generates text one token at a time, it would otherwise have to re-process the entire sequence for each new token. The KV cache stores the attention "key" and "value" vectors for tokens already seen, so each new token only does a little new work. This is what makes generation reasonably fast.

The cost is memory. The KV cache grows with the number of tokens in context and with how many requests you run in parallel, and it lives in VRAM alongside the weights. With long contexts or many concurrent users, the KV cache can use as much memory as the model itself — which is why long-context serving is memory-hungry.

Serving engines work hard to manage it: vLLM's PagedAttention, for instance, stores the KV cache in non-contiguous blocks (like OS virtual memory) to cut waste and pack in more requests.

Related

Context window · VRAM · vLLM · Inference

All explainers → · Browse models →

Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.