Spanvero How it works Find a model Compare models Pricing

KV cache

A per-request memory buffer that stores the attention keys and values for tokens already processed, so the model doesn't recompute them for every new token — it speeds up generation but consumes VRAM that grows with context length.

The KV cache is one of those behind-the-scenes mechanisms that quietly explains a lot of what you observe when running models: why generation is reasonably fast, why long contexts eat so much memory, and why serving many users at once is so VRAM-hungry. "KV" stands for keys and values, the two of the three vectors (query, key, value) that the attention mechanism computes for each token.

Here's the problem it solves. A language model generates text one token at a time, and each new token attends back over all the previous tokens to decide what comes next. Naively, producing token number 500 would mean re-processing all 499 tokens before it — and token 501 would re-process 500, and so on. That's an enormous amount of repeated work that would make generation painfully slow, growing worse with every token. The KV cache fixes this by storing the key and value vectors for every token the model has already processed. When a new token is generated, the model only needs to compute the new token's vectors and attend against the cached ones — a small, roughly constant amount of new work per token instead of re-doing everything. This caching is the main reason autoregressive generation is fast enough to be practical.

The cost of that speed is memory, and it's substantial. The KV cache lives in VRAM right alongside the model's weights, and its size grows with two things: the number of tokens in context, and the number of requests running in parallel. Every token you add to the context adds its keys and values to the cache. Feed a model a long document or a long chat history and the KV cache grows large; run several requests at once (as any real service does) and it multiplies. With long contexts or many concurrent users, the KV cache can end up using as much VRAM as the model itself, or even more. This is the concrete reason that a model which fits comfortably at a short context can run out of memory at a very long one, and why long-context serving is genuinely expensive in hardware terms. When you plan VRAM for a local model, you need headroom for the KV cache on top of the weights — not just the weights alone.

Because the KV cache is such a large and dynamic consumer of memory, managing it well is one of the main jobs of a serious serving engine. vLLM's signature technique, PagedAttention, is a good example: instead of reserving one big contiguous block of memory per request (which wastes a lot when requests are different lengths), it stores the KV cache in small non-contiguous blocks, much like an operating system's virtual memory pages. This slashes wasted memory and lets the engine pack far more concurrent requests onto the same GPU. Other approaches reduce the cache's size at the source — for example, attention variants like grouped-query attention share keys and values across attention heads so there's simply less to cache — which is one reason modern models can offer longer context windows without their memory needs exploding.

The two practical levers you control are context length and concurrency: shorter contexts and fewer simultaneous requests mean a smaller KV cache and lower VRAM use, while long contexts and heavy parallelism mean the opposite. Spanvero's VRAM-to-run estimates account for the KV cache at a realistic context — not just the raw weights — so the numbers reflect what a model actually needs to run usefully. To see how your chosen context length changes the memory a specific model needs, and the honest cost of running it locally, on a rented GPU, or via your own API key, use the calculator at /calculator/.

Context window · VRAM · vLLM · Inference · Tokens · Local vs API vs renting a GPU

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

KV cache

Related

The weekly price index