Spanvero How it works Find a model Compare models Pricing

Running open AI models, explained

Plain-English answers to the terms you hit when running models yourself — quantization, GGUF, VRAM, context windows and more.

Core concepts

Active vs total parameters — In an MoE model, total parameters is everything that must be loaded into memory, while active parameters is the smaller subset actually used per token — the former drives memory, the latter drives speed and cost.
Base vs instruct model — A base model is the raw pretrained text-completion model; an instruct (or chat) model has been further tuned to follow instructions and hold a conversation.
Context window — The maximum number of tokens (prompt plus generated output) a model can consider at once; anything beyond it is cut off or forgotten.
Diffusion model — The dominant architecture for AI image (and video) generation: it learns to turn random noise into a coherent image by removing noise step by step, guided by your prompt.
Embeddings — Numeric vectors that represent the meaning of text (or images), so that similar content sits close together — the backbone of semantic search and Retrieval-Augmented Generation (RAG).
Fine-tuning — Continuing to train an existing model on your own data so it adapts to a specific task, domain, style, or format — as opposed to just prompting an unchanged model.
GGUF — A single-file model format used by llama.cpp, Ollama, and LM Studio that bundles the (usually quantized) weights plus all metadata, optimized for local CPU/GPU inference.
Inference — Actually running a trained model to get outputs — generating text, an image, or a transcription — as opposed to training it.
KV cache — A per-request memory buffer that stores the attention keys and values for tokens already processed, so the model doesn't recompute them for every new token — it speeds up generation but consumes VRAM that grows with context length.
llama.cpp — The open-source C/C++ inference engine that runs GGUF models efficiently on CPUs and consumer GPUs; it's the foundation under Ollama, LM Studio, and many local apps.
LM Studio — A free desktop app with a graphical interface for discovering, downloading, and chatting with local models, and serving them via a local API — a GUI-first alternative to command-line tools.
Local vs API vs renting a GPU — The three ways to actually run an open model: on your own hardware (local, $0 compute), through a hosted pay-per-token API, or by renting a cloud GPU and serving it yourself — each cheapest in different situations.
LoRA — A cheap fine-tuning method that freezes the base model and trains tiny add-on "adapter" matrices, producing a small file you can stack on top of the original weights.
Mixture of Experts (MoE) — An architecture that contains many "expert" sub-networks but routes each token through only a few of them, so a model can have huge total size while doing the compute of a much smaller one.
Ollama — A simple, one-command tool for downloading and running open models locally; it wraps llama.cpp and serves a local API, prioritizing ease of use above all.
Parameters (the "B" / billions) — A model's learned numerical weights; the "B" in a name like "7B" or "70B" means billions of them, and it is the single biggest driver of how big, capable, and expensive a model is to run.
Q4_K_M and quant levels — A naming scheme for llama.cpp/GGUF quantized models — Q4_K_M means 4-bit, "K-quant" method, Medium size — and it is the most commonly recommended balance of size and quality for local use.
Quantization — Storing a model's weights at lower numeric precision (e.g.
Safetensors — Hugging Face's standard, safe-to-load weight format that stores raw model tensors without executable code, used as the canonical full-precision distribution format for open models.
Text-to-image — Generating an image from a written description (a prompt); today this is almost always done with diffusion models.
Tokens — The chunks of text (roughly word-pieces) that a model reads and writes; pricing, speed, and context limits are all measured in tokens, not words.
TTS / ASR (text-to-speech & speech recognition) — TTS (text-to-speech) turns written text into spoken audio; ASR (automatic speech recognition) does the reverse, transcribing speech into text.
vLLM — A high-throughput, GPU-focused serving engine for LLMs, designed to serve many concurrent requests efficiently using PagedAttention and continuous batching.
VRAM — The dedicated memory on your GPU; a model's weights plus its KV cache must fit in VRAM to run fast, making it the single biggest hardware limit for local AI.

Common questions

Can I run Llama 3 on a MacBook? — Yes — Apple Silicon MacBooks are genuinely good at local AI because the GPU shares the machine's unified memory, so your total RAM is effectively your VRAM budget; what size you can run depends on how much RAM you have.
Do I need a GPU to run local AI? — No — small models run on a CPU (just slowly), and Apple Silicon Macs run models well using shared memory instead of a separate GPU; but a GPU with enough VRAM is what makes larger models run at a comfortable speed.
GGUF vs safetensors — which should I download? — Download GGUF if you're running a model locally on ordinary hardware (it's a single pre-quantized file that friendly tools use); download safetensors if you're fine-tuning or serving at scale on GPUs — that's the full-precision original.
H100 vs A100 for inference — The H100 is the newer, faster card with more memory bandwidth; the A100 is older, cheaper to rent, and often plenty for many inference jobs — the right choice is the cheapest one that has enough VRAM and throughput for your specific workload.
How big is a 7B / 70B model download? — It depends on the quant: a 7B is about 4-5 GB at 4-bit (14 GB at full precision), and a 70B is about 40 GB at 4-bit (140 GB+ at full) — the download size closely tracks the memory the model needs to run.
How do I choose which AI model to run? — Filter by objective facts first — what fits your hardware, the license for your use, and the task type — then test the top candidates on your own work, because quality is best judged on your task, not on someone else's benchmark.
How do I pick a model for coding? — Start with a recognized coding-tuned model in a size your hardware can run, prefer a permissive license if it's for work, and test it on your own real code — coding quality is best judged on your stack, not on someone else's benchmark.
How do I run AI privately and offline? — Run an open model locally with a tool like Ollama or LM Studio — the model lives on your machine, so nothing you type or generate ever leaves it, and it works with no internet connection at all.
How do I run my first local AI model? — Install a friendly runner like Ollama or LM Studio, pick a small-to-mid model that fits your hardware, and run one command (or click download) — you'll be chatting with a fully local model in minutes, for free.
How many tokens per second is usable? — A rough guide: below ~5 tokens/sec feels sluggish, ~10-20 is comfortable for chat (around or above reading speed), and higher is nice but has diminishing returns for a single reader — what you actually get depends on the model, quant, and hardware.
How much does it cost to run an AI model? — It depends on how you run it: locally the compute is effectively $0 (just electricity, after you own the hardware), a hosted API charges per token, and a rented GPU charges per hour — the cheapest option changes with your usage.
How much RAM vs VRAM do I need for LLMs? — VRAM (GPU memory) is what a model needs to run fast, and it's usually the binding limit; system RAM matters mainly for CPU-only running and as slower overflow — except on Apple Silicon Macs, where unified memory means RAM is your VRAM.
How much VRAM does a 70B model need? — Roughly 40 GB at the common 4-bit quant (about 140 GB at full 16-bit), plus several GB of headroom for the KV cache — so a 70B is a two-24GB-card, workstation, or big-Mac job, not a single-consumer-GPU one.
Is a used RTX 3090 good for local LLMs in 2026? — Yes — the used RTX 3090's 24 GB of VRAM makes it one of the best value cards for local LLMs, since VRAM (not raw speed) decides what you can run, and 24 GB comfortably fits strong 32B-class models at 4-bit.
Is fine-tuning worth it, or should I just prompt? — Usually start with prompting and retrieval (RAG) — they're cheaper, faster to iterate, and handle most needs; fine-tune only when you need consistent style, format, or task behavior that prompting can't reliably deliver, and use RAG (not fine-tuning) for facts.
Is renting an H100 worth it? — It's worth it when your workload is heavy and steady enough to keep the GPU busy — an H100 rented by the hour is only cheap per token if you actually use most of those hours; for light or bursty use, a pay-per-token API is almost always cheaper.
Is running AI locally cheaper than ChatGPT? — It can be, but 'local' isn't free — you pay for hardware and electricity instead of a subscription or per-token fees; for light use a hosted service is often cheaper, while heavy use, privacy, and offline access favor local.
What GPU should I buy for running local LLMs? — Buy for VRAM first — it decides what you can run — so a 24 GB card (like a used RTX 3090 or a 4090) is the sweet spot; 16 GB is a strong mid-range choice, and 8-12 GB is a fine entry point for smaller models.
What LLMs can I run on 12GB of VRAM? — 12 GB comfortably runs 7-8B models with plenty of headroom (so you can use higher quants or longer contexts) and reaches into the low-teens-billion range at 4-bit — a nice step up from 8 GB.
What LLMs can I run on 16GB of VRAM? — 16 GB comfortably runs models up into the mid-teens-billion range at 4-bit with room for context, and lets you run 7-13B models at high quants — a strong, well-balanced tier for local AI.
What LLMs can I run on 24GB of VRAM? — 24 GB is the local sweet spot — it comfortably fits 32B-class models at 4-bit with room for context, and runs smaller models at very high quants or long contexts; only 70B-and-up models are out of reach.
What LLMs can I run on 8GB of VRAM? — On 8 GB you can comfortably run small-to-7B models at 4-bit (with headroom for context), which covers a lot of genuinely useful chat, coding, and writing models — 8 GB is a solid entry point for local AI.
What quantization should I use? — Use the highest quant your VRAM comfortably fits with headroom for context — for most people that's Q4_K_M, stepping up to Q5_K_M or Q6_K if you have room, and only dropping to Q3 or below to make a model fit at all.
What's the cheapest way to run a 70B model? — For occasional use, a pay-per-token API with your own key is usually cheapest; for heavy steady use, a rented GPU serving it yourself wins; running locally is only 'free' if you already own a 48GB-class rig or a big Mac.

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.