VRAM

The dedicated memory on your GPU; a model's weights (plus its KV cache) must fit in VRAM to run fast, making it the main hardware limit for local AI.

VRAM is the fast memory attached to a graphics card. To run a model quickly, its weights need to fit in VRAM; if they don't, you either offload part of it to system RAM (much slower) or can't run it at all. This is why VRAM, not raw GPU speed, is usually the deciding factor for what you can run locally.

Memory needed depends on parameter count and quantization. As a rough guide: ~2 GB per billion parameters at 16-bit, ~1 GB/B at 8-bit, and ~0.5 GB/B at 4-bit — so a 7B model in Q4 needs roughly 4-5 GB of weights. You also need headroom for the KV cache, which grows with context length and how many requests run at once.

Spanvero computes an estimated VRAM-to-run for each model at its default quant, so you can objectively filter to "models that fit my card" — this is a measurable, transparent criterion, not a quality judgment.

Related

Quantization · Parameters (the "B" / billions) · KV cache · Local vs API vs renting a GPU

All explainers → · Browse models →

Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.