GGUF vs safetensors — which should I download?

Download GGUF if you're running a model locally on ordinary hardware (it's a single pre-quantized file that friendly tools use); download safetensors if you're fine-tuning or serving at scale on GPUs — that's the full-precision original.

When you go to download an open model, you'll usually see it offered in two very different-looking forms: a set of safetensors files, or a folder of GGUF files at various quant levels. Picking the wrong one is a common early mistake, and the rule for choosing is actually simple once you know what each format is for.

Safetensors is the format models are originally published in — it's the canonical, full-precision (16-bit) version of the weights, split across shard files alongside separate config and tokenizer files. It was designed as a safe replacement for the old Python "pickle" checkpoint format, which could execute arbitrary code when loaded; safetensors stores only raw tensor data with no code path, so it's safe to load. It's what training frameworks read and write, and what high-throughput GPU serving engines like vLLM load directly. Because it's full precision, its size follows the roughly-2-GB-per-billion-parameters rule: a 7B is about 14 GB, a 70B well over 100 GB.

GGUF is the format you download to run a model locally with a friendly tool. It's a single self-contained file that bundles the weights, the tokenizer, and all the metadata the runtime needs, and — crucially — it natively supports quantization. A GGUF download is typically already shrunk to a size that runs on consumer hardware: a 7B in a 4-bit GGUF is only about 4-5 GB instead of 14. Ollama, LM Studio, and llama.cpp all consume GGUF, and a single model is usually offered as many GGUF files, one per quant level (Q4_K_M, Q5_K_M, Q8_0, and so on) — you download just the one quant that fits your hardware.

So the decision comes down to what you're doing:

Download GGUF if your goal is to run a model on your own laptop or desktop for personal use. It's the format the easy local tools want, it's pre-quantized so it fits ordinary hardware, and it's a single file with no loose parts to manage. This is the right choice for the vast majority of people just running models. Pick the quant level that fits your VRAM with headroom to spare — Q4_K_M is the usual recommended default.

Download safetensors if your goal is to fine-tune a model, serve it to many users at scale on GPUs with an engine like vLLM, or do anything in a training framework — those tools expect the full-precision original. It's also the format to grab if you want to convert and quantize the model to GGUF yourself from a faithful source, or if you specifically need full 16-bit precision.

A few clarifications that resolve common confusion. The two aren't rivals so much as different links in a chain: a model is trained and published as safetensors, then the community (or the publisher) converts it to GGUF and quantizes it for local use. GGUF also handles the chat template for you — because it bundles the prompt-format metadata, an instruct model behaves like an assistant out of the box — whereas with raw safetensors in a serving engine you may need to apply the template yourself. And on the safety point, safetensors' whole reason for existing is that it can't run code on load; GGUF is likewise just data, so both are safe downloads in that respect.

If you're ever unsure, ask yourself one question: am I running this model for myself, or building/serving/training with it? Running it yourself means GGUF; building, serving at scale, or training means safetensors. That single distinction resolves almost every case.

Spanvero's local-run cost estimates assume the GGUF path, because that's how people actually run models at home for effectively $0 in compute, and its rent-a-GPU serving estimates assume the safetensors-plus-vLLM path. To find models sized for your hardware in their GGUF form, browse /models/8gb-vram/ or /models/16gb-vram/; to compare the honest cost of the local-GGUF route against the rented-GPU serving route for any model, open /calculator/.

Related

GGUF · Safetensors · Quantization · What quantization should I use? · How do I run my first local AI model? · Ollama · vLLM · How big is a 7B / 70B model download?

All explainers → · Browse models →

Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.