How much RAM vs VRAM do I need for LLMs?

VRAM (GPU memory) is what a model needs to run fast, and it's usually the binding limit; system RAM matters mainly for CPU-only running and as slower overflow — except on Apple Silicon Macs, where unified memory means RAM is your VRAM.

"How much RAM vs VRAM do I need?" trips people up because the two kinds of memory play very different roles for local AI, and which one matters depends on your hardware. Getting the distinction straight saves you from buying the wrong thing.

VRAM is the fast memory attached to your graphics card, and for running models on a PC with a GPU, it's the number that matters most. To run a model quickly, its weights (plus the KV cache) need to fit in VRAM so the whole thing runs on the GPU. If it fits, you get full GPU speed; if it doesn't, you either offload part to slower system RAM (and generation slows sharply) or can't run it at all. So on a GPU-equipped PC, VRAM is the binding constraint on what you can run and how fast — a modest card with plenty of VRAM beats a fast card with too little. Size your VRAM to the models you want: about 0.5 GB per billion parameters at 4-bit, so an 8 GB card handles small-to-7B models, 16 GB handles up to the mid-teens-billion range, and 24 GB comfortably fits 32B-class models.

System RAM is the general memory your CPU uses. For local AI on a GPU-equipped PC, RAM plays a supporting role: you want enough to load the model file and run the OS and apps, but the model runs from VRAM, so piling on system RAM doesn't let you run bigger models fast. Where RAM becomes the main event is CPU-only running — if you have no dedicated GPU, the model runs from system RAM on the CPU, so your RAM is effectively your model-size budget (and speed is much slower than GPU). RAM also acts as overflow: llama.cpp can split a model between GPU and CPU, keeping as many layers as fit in VRAM and running the rest from RAM, so having ample RAM lets you run a model slightly too big for your card — just slower for the overflow part.

Apple Silicon Macs change the whole picture with unified memory. There's no separate VRAM number; the CPU and GPU share one memory pool, so your total system RAM is effectively your VRAM budget (minus what macOS needs). This is why the RAM-vs-VRAM distinction collapses on a Mac — on an M-series machine, "how much RAM" is the only question, and a 32 GB or 64 GB Mac can run models that would need an expensive dedicated GPU on a PC. When people ask about RAM vs VRAM for a Mac, the answer is simply: buy as much unified memory as you can, because it's both.

So the practical guidance splits by hardware. On a PC with a GPU: prioritize VRAM to fit the models you want, and have enough system RAM (16-32 GB is a comfortable baseline) to load files and allow some overflow, but don't expect extra RAM to substitute for VRAM. On a PC without a GPU: your system RAM is your model budget, so more RAM lets you run bigger models on the CPU, accepting slower speeds. On an Apple Silicon Mac: unified memory is everything — more of it means bigger, faster models, with no separate VRAM to think about.

A common buying mistake is loading up on cheap system RAM expecting it to run big models fast on a GPU rig — it won't, because the GPU can only run fast from VRAM. The reverse mistake is ignoring RAM entirely on a CPU-only or overflow setup, where it's the whole budget. Match the memory type to how you'll actually run: GPU means VRAM-first, CPU-only means RAM-first, and Mac means unified-memory-first.

Spanvero computes what fits your specific memory situation — treating a Mac's unified RAM as a VRAM budget and a discrete card's VRAM as the hard limit — so you can filter models to what your machine can actually run. Use /calculator/ to enter your VRAM (or Mac RAM) and a model to see whether it fits and how; browse by budget at /models/8gb-vram/, /models/16gb-vram/, and /models/24gb-vram/; and read how a Mac's memory works in the guide at /learn/run-llama-on-macbook/.

VRAM · Can I run Llama 3 on a MacBook? · Do I need a GPU to run local AI? · KV cache · What GPU should I buy for running local LLMs? · Quantization · What LLMs can I run on 16GB of VRAM? · llama.cpp

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

How much RAM vs VRAM do I need for LLMs?

Related

The weekly price index