llama.cpp

The open-source C/C++ inference engine that runs GGUF models efficiently on CPUs and consumer GPUs; it's the foundation under Ollama, LM Studio, and many local apps.

llama.cpp is a lightweight, dependency-minimal engine for running LLMs locally. It pioneered the GGUF format and the K-quant quantization levels (Q4_K_M and friends), and it's engineered to get good performance on ordinary hardware — CPU-only, GPU, or a split between the two.

Because it's fast, portable, and free, it became the de facto base layer for the local-AI ecosystem: tools like Ollama and LM Studio use llama.cpp internally and add friendlier interfaces on top. You can also use it directly via its command line or its built-in server.

If you're running quantized GGUF models on consumer hardware, you're almost certainly running llama.cpp somewhere in the stack. It's a key reason "run it locally for $0 in compute" is a real option.

Related

GGUF · Q4_K_M and quant levels · Ollama · LM Studio

All explainers → · Browse models →

Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.