Spanvero How it works Find a model Compare models Pricing

VRAM

The dedicated memory on your GPU; a model's weights plus its KV cache must fit in VRAM to run fast, making it the single biggest hardware limit for local AI.

VRAM is the fast memory attached to your graphics card, and for local AI it is the number that matters most. To run a model quickly, the model's weights need to fit in VRAM. If they don't fit, you have two options, both worse: offload part of the model to your system RAM (which is far slower, so generation crawls), or you simply can't run it. This is why VRAM — not raw GPU speed, not the number of cores — is usually the deciding factor in what you can run at home. A modest card with enough VRAM will happily run a model that a faster card with too little VRAM can't load at all.

How much you need comes down to two things: the model's parameter count and the quantization level you run it at. The rules of thumb are worth memorizing. At full 16-bit precision, budget about 2 GB per billion parameters. At 8-bit, about 1 GB per billion. At 4-bit (the common local default), about 0.5 GB per billion. So a 7B model in 4-bit needs roughly 4-5 GB just for weights; a 13B needs around 8 GB; a 34B around 20 GB; and a 70B around 40 GB. Those figures are why a 24 GB card like an RTX 3090 or 4090 is such a sweet spot — it comfortably holds 32B-class models at 4-bit with room to spare, and can stretch to larger ones.

But weights are only part of the bill. You also need headroom for the KV cache, the per-request memory that stores attention keys and values for the tokens already processed. The KV cache grows with how long your context is and how many requests run at once, and with long contexts it can consume a surprising amount of VRAM — sometimes rivaling the weights themselves. This is why a model that "fits" in weight terms can still run out of memory once you feed it a long document, and why realistic VRAM estimates have to account for the context window, not just the parameters. The runtime itself and CUDA overhead also eat a little.

A practical way to plan: take the weight estimate for your target model and quant, then leave several extra gigabytes of headroom for the KV cache and overhead. If it's tight, you can step down to a smaller quant (see Q4_K_M and quant levels) or shorten your context window to reclaim memory. Bigger models tolerate lower quants better, so on large models you can lean on 4-bit or even 3-bit to make things fit.

A few hardware notes that matter in practice. On Apple Silicon Macs there's no separate VRAM number — the GPU shares the machine's unified memory with the system, so your total RAM is effectively your VRAM budget (minus what the OS needs). This is why a 32 GB or 64 GB Mac can run surprisingly large models locally. On PCs, VRAM is the fixed amount soldered to the graphics card and can't be upgraded, so it's worth checking before you buy: a 24 GB card comfortably fits 32B-class models at 4-bit, a 16 GB card handles up-to-mid-teens-billion models well, and an 8 GB card is best for small-to-7B models. Running a model that spills over your VRAM into system RAM works, but the overflow layers run on the far slower CPU/RAM path, so speed drops sharply — often the difference between a snappy response and a crawl.

It's also worth separating VRAM (the constraint on whether a model runs at all) from GPU speed (which affects how fast it runs once it fits). A newer card with more compute and memory bandwidth will generate faster, but only among models that already fit. That's why, for the question "what can I run?", VRAM is the number to lead with, and speed is a secondary consideration.

This is exactly the calculation Spanvero does for you. We compute an estimated VRAM-to-run for every model at its default quant and a realistic context, so you can objectively filter to "models that actually fit my card" — a measurable, transparent criterion, never a quality judgment. Browse straight to what fits your GPU at /models/8gb-vram/, /models/16gb-vram/, /models/24gb-vram/, or /models/48gb-vram/, see the ranked picks for your budget under /best/best-llm-for-8gb-vram/ and its siblings, or plug in an exact model, quant, and context length at /calculator/ to see the precise memory it needs. If a model is too big for your VRAM, the same page shows you the honest cost of the alternatives — renting a GPU or using your own API key.

Start from the VRAM you actually have

The fastest route to a useful model is to filter by your card, then compare cost and quality within that fit set.

Can I run it? → — Choose your GPU or Mac and get a sourced fit list.
Calculate exact VRAM → — Include quant and context instead of guessing from parameter count.
Rent more VRAM → — Compare the current hourly price of cards larger than yours.

Quantization · Parameters (the "B" / billions) · KV cache · Local vs API vs renting a GPU · Context window · Q4_K_M and quant levels

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

VRAM

Start from the VRAM you actually have

Related

The weekly price index