Roughly 40 GB at the common 4-bit quant (about 140 GB at full 16-bit), plus several GB of headroom for the KV cache — so a 70B is a two-24GB-card, workstation, or big-Mac job, not a single-consumer-GPU one.
This is one of the most-searched sizing questions, and the honest answer starts with a simple rule of thumb: a model's weights need about 2 GB of memory per billion parameters at full 16-bit precision, about 1 GB per billion at 8-bit, and about 0.5 GB per billion at 4-bit. Plug in 70 billion and you get roughly 140 GB at 16-bit, about 70 GB at 8-bit, and about 40 GB at 4-bit. Since almost nobody runs a 70B at full precision locally, the number that matters in practice is the 4-bit figure: around 40 GB just for the weights.
But weights are not the whole bill. You also need headroom for the KV cache — the per-request memory that holds the attention keys and values for every token in your context. On a 70B with a normal context that's several extra gigabytes, and with a very long context it can climb higher. The practical planning figure is therefore "about 40 GB for weights at 4-bit, plus several GB of headroom," which lands most real 70B setups in the mid-40s of gigabytes.
That immediately tells you what hardware is required. A 70B at 4-bit does not fit on a single 24 GB consumer card like an RTX 3090 or 4090 — you're roughly 20 GB short. The realistic options are: two 24 GB cards together (giving 48 GB, which fits comfortably), a single 48 GB workstation card like an RTX A6000, an Apple Silicon Mac with 48 GB or more of unified memory (where system RAM doubles as VRAM), or a rented cloud GPU such as an 80 GB H100 or A100. This is exactly why the 70B tier sits at the boundary between "serious local rig" and "rent a GPU."
There are levers if you're close but not quite there. Dropping to a smaller quant — 3-bit instead of 4-bit — shrinks the weights further, and large models like 70B tolerate aggressive quantization better than small ones because they have more redundancy to spare, so a Q3 70B is a legitimate way to squeeze onto tighter hardware with only a modest quality cost. Shortening your context window reclaims KV-cache memory. And llama.cpp's ability to split a model between GPU and system RAM lets you run a 70B that doesn't fully fit — it just runs slower in proportion to how much spills onto the CPU.
A useful reframe: if a 70B is out of reach for your hardware, a strong 30B-class model at 4-bit fits comfortably on a single 24 GB card and is often good enough that the gap to 70B isn't worth the extra hardware for many tasks. Parameter count drives cost and memory, not quality directly — a well-trained smaller model routinely beats a bigger one on real work, so it's always worth trying the size that fits before assuming you need the bigger one.
The cost picture follows from all this. Locally, a 70B costs only electricity once you own the hardware, but the hardware itself is real money (two high-end cards, a workstation GPU, or a big Mac). Renting a GPU by the hour gives you 70B performance without the up-front purchase and is often the honest cheapest route for occasional use or before you've committed to buying. And for many workloads a pay-per-token API to a hosted 70B, using your own key at the provider's real rate, is cheaper still if your volume is light. Which one wins depends entirely on how much you'll use it.
Spanvero computes the VRAM-to-run for every 70B-class model at its default quant and realistic context, so you can see the exact figure rather than the rule-of-thumb estimate, and then compare the honest, $0-markup cost of running it locally, on a rented GPU, or via your own API key. Plug a specific 70B and context into /calculator/ to see its real memory bill, browse what fits a two-card or workstation setup at /models/48gb-vram/, or see the cheapest route to a big model on the guide at /learn/cheapest-way-to-run-a-70b/.
What's the cheapest way to run a 70B model? · VRAM · Quantization · Is renting an H100 worth it? · How much RAM vs VRAM do I need for LLMs? · Parameters (the "B" / billions) · KV cache · Local vs API vs renting a GPU
All explainers → · Browse models →
Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.
A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.