How big is a 7B / 70B model download?

It depends on the quant: a 7B is about 4-5 GB at 4-bit (14 GB at full precision), and a 70B is about 40 GB at 4-bit (140 GB+ at full) — the download size closely tracks the memory the model needs to run.

Before downloading a model, it's smart to know how many gigabytes you're committing to — both for your disk space and your bandwidth. The good news is the size is easy to estimate, and it closely tracks the memory the model will need to run, so the same rule of thumb answers both questions at once.

The core formula: a model's file size (and roughly its memory footprint) is the parameter count times the bits per weight, divided by eight to convert bits to bytes. In practice this collapses to three handy figures. At full 16-bit precision, budget about 2 GB per billion parameters. At 8-bit, about 1 GB per billion. At 4-bit — the common quant for local use — about 0.5 GB per billion. So the download size depends heavily on which version (which quant) you grab.

For a 7B model: at full 16-bit precision (the safetensors original) it's about 14 GB across its shard files. As a 4-bit GGUF — the version most people download to run locally — it's about 4-5 GB (a touch more than the raw 3.5 GB the formula gives, because K-quants keep some tensors at higher precision). At 8-bit it's around 7 GB. So a 7B is a very manageable download: a few gigabytes for the quantized version most people want.

For a 70B model: at full 16-bit it's over 140 GB — a serious download usually split across many shards. As a 4-bit GGUF it drops to roughly 40 GB, and at 8-bit around 70 GB. This is why quantization matters so much for downloads too: the 4-bit version is a quarter the size of the full-precision one, turning an impractical 140 GB into a still-large-but-doable 40 GB.

A few practical points. First, GGUF models are usually offered as many files — one per quant level (Q4_K_M, Q5_K_M, Q8_0, and so on) — and you download only the one quant that fits your hardware, not the whole set; the size climbs as the quant number rises. Second, full-precision safetensors releases split the weights across several shard files plus small config and tokenizer files, so the total is the sum of the shards. Third, very large models (like big MoE flagships) can be huge downloads even when they run fast, because total parameter count drives the file size regardless of how few experts are active per token — budget disk and bandwidth for the total size.

The reason file size and memory requirement track each other so closely is that the weights are the bulk of both. The gigabytes on your disk are essentially the same gigabytes that have to fit in VRAM to run. So estimating the download also tells you roughly whether the model will fit your card: if a 4-bit 70B is a 40 GB download, it's also a ~40 GB (plus KV-cache headroom) memory requirement, which is why it needs a 48 GB-class setup. The one thing the file size doesn't include is the KV cache, which is extra memory allocated at runtime for your context — so the running memory need is a bit above the download size.

For planning, keep a few disk realities in mind: models are large, so collecting several to compare adds up quickly, and it's worth pruning ones you're not using to reclaim space. Downloads of tens of gigabytes also take real time on a typical connection, so it's worth picking the right quant the first time rather than grabbing several. The friendly tools help here — LM Studio's model browser, for instance, flags which quantized versions fit your machine before you commit to the download.

Spanvero surfaces the parameter count for every model and computes its VRAM-to-run at the default quant, which closely tracks the download size, so you can gauge both before you fetch anything. Browse models sized for your hardware at /models/8gb-vram/ or /models/24gb-vram/, and use /calculator/ to see the memory (and thus roughly the download) a specific model and quant require — plus the honest cost to run it locally, on a rented GPU, or via an API.

Related

Parameters (the "B" / billions) · Quantization · GGUF vs safetensors — which should I download? · VRAM · What quantization should I use? · Mixture of Experts (MoE) · How do I run my first local AI model? · Q4_K_M and quant levels

All explainers → · Browse models →

Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.