What LLMs can I run on 12GB of VRAM?

12 GB comfortably runs 7-8B models with plenty of headroom (so you can use higher quants or longer contexts) and reaches into the low-teens-billion range at 4-bit — a nice step up from 8 GB.

A 12 GB graphics card — like an RTX 3060 12 GB, a 4070, or a 6700 XT — sits in a useful middle ground between the entry 8 GB tier and the sweet-spot 16-24 GB cards. It's a genuine step up, and knowing what the extra 4 GB buys you helps you pick models that make the most of it.

The sizing math is the same rule of thumb: about 0.5 GB per billion parameters at 4-bit for the weights, plus a few gigabytes of headroom for the KV cache, runtime, and OS. On 12 GB that means 7-8B models run with comfortable headroom — enough that you can step up to a higher quant (Q5 or Q6 instead of Q4) for a bit more quality, or use a longer context, rather than running right at the edge. And you can reach into the low-teens-billion range (models around 13B) at 4-bit, which starts to include noticeably stronger general models than the 7-8B tier alone.

So the practical picture on 12 GB: 7-8B models become roomy rather than tight, giving you flexibility on quant and context; 13B-class models come into range at 4-bit; and small models (under 4B) run with tons of headroom for fast, lightweight use. That flexibility is the real value of 12 GB — not a dramatically bigger maximum model than 8 GB, but the ability to run the same useful sizes with better quality settings and longer contexts, plus a reach into the low teens.

The honest limits: 12 GB still won't comfortably hold mid-size models much above 13B at a useful quant, and it's well short of the 24 GB you'd want for 32B-class models. Long contexts still grow the KV cache, so a 13B model with a very long context can push against 12 GB — the levers are the same as always: smaller quant, shorter context, or a GPU/CPU split for overflow. Bigger models than the low teens mean stepping up to more VRAM, renting a GPU, or using an API.

A useful way to think about 12 GB versus 8 GB: the jump mostly buys comfort and quality at the sizes you'd already run, plus a modest reach upward, rather than unlocking a whole new class of model. If your goal is specifically to run 32B-class models, 12 GB won't get you there and you'd want a 24 GB card. But for running 7-13B models well — with room for higher quants and longer contexts — 12 GB is a comfortable, cost-effective choice.

The cost angle is the same appealing story: a 12 GB card runs these models for effectively $0 in compute beyond electricity, fully private and offline. For personal use of 7-13B models, that's very cheap once you own the card. Larger ambitions point to more VRAM, a rented GPU, or a pay-per-token API with your own key, depending on how much you'll use them.

Spanvero computes what fits 12 GB objectively, at each model's default quant and a realistic context. See the ranked list of models that fit a 12 GB card at /best/best-llm-for-12gb-vram/, browse the neighboring tiers at /models/8gb-vram/ and /models/16gb-vram/, and use /calculator/ to check a specific model, quant, and context against 12 GB and see its honest local cost. If you're deciding between card sizes, compare the tiers in the guides at /learn/what-can-i-run-on-8gb-vram/ and /learn/what-can-i-run-on-16gb-vram/.

What LLMs can I run on 8GB of VRAM? · What LLMs can I run on 16GB of VRAM? · VRAM · What quantization should I use? · What GPU should I buy for running local LLMs? · Quantization · KV cache · How much RAM vs VRAM do I need for LLMs?

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

What LLMs can I run on 12GB of VRAM?

Related

The weekly price index