What LLMs can I run on 8GB of VRAM?

On 8 GB you can comfortably run small-to-7B models at 4-bit (with headroom for context), which covers a lot of genuinely useful chat, coding, and writing models — 8 GB is a solid entry point for local AI.

An 8 GB graphics card — like an RTX 3060, 4060, or an 8 GB laptop GPU — is a common entry point for local AI, and the honest news is that it runs a lot of genuinely useful models. The key is knowing what fits, and why, so you pick models that leave room to actually work.

The sizing math: at the common 4-bit quant, budget about 0.5 GB per billion parameters for the weights, plus a few gigabytes of headroom for the KV cache (the per-request memory that grows with your context length) and for the runtime and your OS. On an 8 GB card that headroom is meaningful, so the practical ceiling isn't the full 8 GB of weights but something with a few gigabytes to spare. That puts small models (under 4B) firmly in comfortable territory and 7-8B models within reach at 4-bit, which is the workhorse range most home users run.

What that gets you is more than it sounds. The 7-8B tier at 4-bit includes capable general chat models, solid coding models, and strong writing and summarization models — enough for real daily use, not just toys. Small models under 4B are fast and leave lots of headroom, good for quick tasks, on-device use, and running alongside other applications. So an 8 GB card is not a consolation prize; it's a legitimate local-AI setup for the model sizes most people actually want.

The honest limits: 8 GB won't hold mid-size models (14B and up) at a useful quant without spilling into slower system RAM, and it can't run large or flagship models. If you feed a 7-8B model a very long context, the growing KV cache can push you over 8 GB, so long-context work on 8 GB means keeping the context reasonable or stepping down a quant to free memory. And you'll generally want to stay at Q4 or above on these smaller models, since low-bit quantization bites harder on small models than large ones.

A few levers if you're close to the edge: choose a slightly smaller model or a smaller quant to fit, shorten your context window to reclaim KV-cache memory, or use llama.cpp's GPU/CPU split to run a model slightly too big — it just runs slower for the overflow layers. The friendly runners (Ollama, LM Studio) make these adjustments easy and show you the memory impact.

The cost angle is the appealing part: an 8 GB card runs these models for effectively $0 in compute — just electricity — and everything stays private and offline. For light-to-moderate personal use of small and 7-8B models, that's about as cheap as AI gets. If you later want bigger models, the honest options are a card with more VRAM, a rented GPU, or a pay-per-token API with your own key, depending on your usage.

Spanvero makes "what fits 8 GB" an objective, computed answer rather than a guess. It calculates the VRAM-to-run for every model at its default quant and a realistic context, so you can filter straight to what your 8 GB card can hold. See the full list of models that fit, ranked by how much model you get per gigabyte, at /models/8gb-vram/ and the ranked picks at /best/best-llm-for-8gb-vram/; browse the smallest, fastest options at /best/best-small-llms/; and use /calculator/ to check a specific model and context against 8 GB and see its honest local cost. If you're weighing a card upgrade, compare against the next tiers in the guides at /learn/what-can-i-run-on-12gb-vram/ and /learn/what-can-i-run-on-16gb-vram/.

Related

What LLMs can I run on 12GB of VRAM? · What LLMs can I run on 16GB of VRAM? · VRAM · What quantization should I use? · Do I need a GPU to run local AI? · What GPU should I buy for running local LLMs? · Quantization · KV cache

All explainers → · Browse models →

Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.