Plain-English answers to the terms you hit when running models yourself — quantization, GGUF, VRAM, context windows and more.
Active vs total parameters — In an MoE model, total parameters is everything that must be loaded into memory, while active parameters is the smaller subset actually used per token — the former drives memory, the latter drives speed/cost.
Base vs instruct model — A base model is the raw pretrained text-completion model; an instruct (or chat) model has been further tuned to follow instructions and hold a conversation.
Context window — The maximum number of tokens (prompt plus generated output) a model can consider at once; anything beyond it is cut off or forgotten.
Diffusion model — The dominant architecture for AI image (and video) generation: it learns to turn random noise into a coherent image by removing noise step by step, guided by your prompt.
Embeddings — Numeric vectors that represent the meaning of text (or images), so that similar content sits close together — the backbone of semantic search and RAG.
Fine-tuning — Continuing to train an existing model on your own data so it adapts to a specific task, domain, style, or format.
GGUF — A single-file model format used by llama.cpp, Ollama, and LM Studio that bundles the (usually quantized) weights plus all metadata, optimized for local CPU/GPU inference.
Inference — Actually running a trained model to get outputs — generating text, an image, or a transcription — as opposed to training it.
KV cache — A per-request memory buffer that stores the attention keys and values for tokens already processed, so the model doesn't recompute them on every new token — it speeds up generation but consumes VRAM that grows with context length.
llama.cpp — The open-source C/C++ inference engine that runs GGUF models efficiently on CPUs and consumer GPUs; it's the foundation under Ollama, LM Studio, and many local apps.
LM Studio — A free desktop app with a graphical interface for discovering, downloading, and chatting with local models, and serving them via a local API — a GUI-first alternative to command-line tools.
Local vs API vs renting a GPU — The three ways to actually run an open model: on your own hardware (local, $0 compute), through a hosted pay-per-token API, or by renting a cloud GPU and serving it yourself.
LoRA — A cheap fine-tuning method that freezes the base model and trains tiny add-on "adapter" matrices, producing a small file you can stack on top of the original weights.
Mixture of Experts (MoE) — An architecture that contains many "expert" sub-networks but routes each token through only a few of them, so a model can have huge total size while doing the compute of a much smaller one.
Ollama — A simple, one-command tool for downloading and running open models locally; it wraps llama.cpp and serves a local API, prioritizing ease of use.
Parameters (the "B" / billions) — A model's learned numerical weights; the "B" in a name like "7B" or "70B" means billions of them, and it is the single biggest driver of how big and capable a model is.
Q4_K_M and quant levels — A naming scheme for llama.cpp/GGUF quantized models — Q4_K_M means 4-bit, "K-quant" method, Medium size — and is the most commonly recommended balance of size and quality.
Quantization — Storing a model's weights at lower numeric precision (e.g. 4-bit instead of 16-bit) to shrink its memory footprint and speed it up, at the cost of some accuracy.
Safetensors — Hugging Face's standard, safe-to-load weight format that stores raw model tensors without executable code, used as the canonical full-precision distribution format for open models.
Text-to-image — Generating an image from a written description (a prompt); today this is almost always done with diffusion models.
Tokens — The chunks of text (roughly word-pieces) that a model reads and writes; pricing, speed, and context limits are all measured in tokens, not words.
TTS / ASR (text-to-speech & speech recognition) — TTS (text-to-speech) turns written text into spoken audio; ASR (automatic speech recognition) does the reverse, transcribing speech into text.
vLLM — A high-throughput, GPU-focused serving engine for LLMs, designed to serve many concurrent requests efficiently using PagedAttention and continuous batching.
VRAM — The dedicated memory on your GPU; a model's weights (plus its KV cache) must fit in VRAM to run fast, making it the main hardware limit for local AI.