Spanvero How it works Find a model Compare models Pricing

GGUF

A single-file model format used by llama.cpp, Ollama, and LM Studio that bundles the (usually quantized) weights plus all metadata, optimized for local CPU/GPU inference.

GGUF (GPT-Generated Unified Format) is the file format you download when you want to run a model locally with any llama.cpp-based tool. If you've ever grabbed a model to use in Ollama or LM Studio, you were almost certainly downloading a GGUF, even if the tool hid the filename from you. Understanding it clears up a lot of confusion about why models come in different formats and which one you actually want.

GGUF's defining trait is that it is self-contained. A single .gguf file packs the model weights, the tokenizer, and all the metadata the runtime needs (architecture details, prompt template hints, and so on) into one place. There are no loose config files to keep in sync and no separate tokenizer download. Just as importantly, GGUF natively supports quantization — the Q4_K_M, Q5_K_M, Q8_0 variants live inside GGUF — so a GGUF download is typically already shrunk to a size that runs on consumer hardware. When you see a model offered in a dozen quant options, those are usually a dozen GGUF files of the same model at different bit-widths.

GGUF is built for inference, not training. It's optimized to load fast (including memory-mapped loading) and to run efficiently across a range of hardware — CPU only, GPU, or a split between the two, where some layers run on the GPU and the rest on the CPU. That flexibility is exactly what makes it the backbone of the consumer local-AI ecosystem: Ollama, LM Studio, and countless small apps all consume GGUF because it lets ordinary machines run capable models. GGUF replaced the older GGML format from the same project; if you run into GGML files, they're the previous generation and largely superseded.

It helps to contrast GGUF with the format models are usually published in. New open models are typically released on the Hugging Face Hub as safetensors — the full-precision weights, split across config and tokenizer files. Those are what training frameworks and GPU serving engines like vLLM read directly. The common workflow is: a model is trained and published as safetensors, then the community (or the publisher) converts it to GGUF and quantizes it for local use. So the rule of thumb is simple — safetensors for training and high-throughput GPU serving, GGUF for local quantized inference on your own machine.

One practical detail that saves confusion: because a GGUF file bundles the tokenizer and the prompt-format metadata, tools that consume it can apply the correct chat template automatically, so an instruct model behaves like an assistant out of the box (see the base vs instruct explainer for why the template matters). You'll also notice that a single model is usually offered as many GGUF files — one per quant level — and they're often split into shards for very large models. You download just the one quant that fits your hardware, not the whole set. When a repository lists a folder of files like "model-Q4_K_M.gguf," "model-Q5_K_M.gguf," and "model-Q8_0.gguf," those are the same model at different sizes; pick the one your VRAM can hold.

GGUF's portability is part of why the local-AI ecosystem converged on it. The same .gguf file runs unchanged whether the underlying engine puts it on an NVIDIA GPU, an AMD GPU, an Apple Silicon Mac's unified memory, or a plain CPU — the runtime handles the hardware differences. That cross-platform consistency is exactly what lets a tool like Ollama or LM Studio offer "one model, runs anywhere" without you having to think about your specific chip.

For practical purposes: if you're running a model on your own laptop or desktop through a friendly tool, you want the GGUF version, and you want to pick a quant level that fits your VRAM (see the guide on Q4_K_M and quant levels for how to choose). Spanvero's local-run cost estimates assume this GGUF path, because it's how people actually run models at home for effectively $0 in compute. To find models sized for your specific hardware, browse /models/8gb-vram/ or /models/16gb-vram/, and to compare the three ways to run any model — local GGUF, rented GPU, or your own API key — open the advisor at /calculator/.

Ready to choose a GGUF model?

Start from your available memory, then open a model page for the exact local, rental and API costs.

Models for 8 GB VRAM → — Smaller GGUF models that fit common entry-level cards.
Models for 24 GB VRAM → — Larger local models for high-end consumer GPUs.
Calculate a specific model → — Use your quant, context and workload instead of a generic rule of thumb.

Safetensors · Q4_K_M and quant levels · llama.cpp · Ollama · LM Studio · Quantization

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

GGUF

Ready to choose a GGUF model?

Related

The weekly price index