Spanvero How it works Find a model Compare models Pricing

Ollama

A simple, one-command tool for downloading and running open models locally; it wraps llama.cpp and serves a local API, prioritizing ease of use above all.

Ollama is the easiest on-ramp to running AI models on your own machine, and for many people it's the first tool that makes local AI actually click. The whole pitch is removing friction: you install it, run a single command like "ollama run llama3," and it downloads a ready-to-go quantized model and drops you straight into a chat — no hunting for the right file, no manual configuration, no wrangling formats. It runs on macOS, Windows, and Linux.

Under the hood, Ollama builds on llama.cpp for the actual inference and adds the conveniences that make it approachable: a curated model library you pull from by name, automatic downloads of the appropriate quantized GGUF, and sensible defaults so things just work. It picks a reasonable quant for your hardware, applies the correct chat template for each model automatically (which matters — see the base vs instruct explainer for why the template matters), and uses your GPU if you have one while falling back to CPU if you don't. You can also customize models with a simple "Modelfile" to set a system prompt or parameters.

A feature that makes Ollama more than a toy is its local HTTP API. When Ollama is running, it exposes a server on your machine that other applications can talk to, including an OpenAI-compatible endpoint. This means you can point existing apps, scripts, or coding tools at your local Ollama instance instead of a paid cloud service — your code calls localhost, the model runs on your own hardware, and no data leaves your machine. For developers and privacy-conscious users, this is the real unlock: local models as a drop-in, zero-per-token backend for whatever you're building.

Where Ollama sits in the landscape is worth being honest about. It optimizes for convenience and single-user simplicity, not maximum throughput. For one person on a laptop or desktop — chatting, coding, prototyping, keeping data private — it's close to ideal. For serving a model to many concurrent users or an app with real traffic, you'd reach for a dedicated serving engine like vLLM instead, which is built for high-throughput batching on GPUs. Ollama and LM Studio occupy the same friendly, local, single-user niche; the main difference is that LM Studio is a graphical desktop app while Ollama is command-line-first (though several GUIs exist that connect to Ollama). All three of these — Ollama, LM Studio, and llama.cpp directly — run the same underlying GGUF models.

Getting started is genuinely a two-step affair: install Ollama, then run "ollama pull <model>" to download a model or "ollama run <model>" to download and immediately start chatting. You can list what you've downloaded, remove models to reclaim disk space, and choose among a model's available sizes and quants by tag (for example a ":7b" or ":q4" suffix). Because each pulled model can be several gigabytes, keeping an eye on disk usage is worthwhile, especially if you collect a few to compare. The models themselves are the same GGUF files the wider ecosystem uses, so anything you learn about quant levels and VRAM applies directly.

A point worth emphasizing for developers: the local API is what turns Ollama from a chat toy into infrastructure. Because it exposes an OpenAI-compatible endpoint on localhost, a huge amount of existing tooling — coding assistants, chat frontends, automation scripts, agent frameworks — can be pointed at your local Ollama with little more than changing a base URL. That means you can prototype and even run real applications against a local model at zero per-token cost, then decide later whether to move to a hosted API or a rented GPU as your needs grow. This makes Ollama a natural starting point even for projects that might eventually scale beyond it.

Because it runs models on hardware you already own, Ollama is the archetype of the "local, $0 compute" option: your only real cost is electricity, and everything stays offline and private. That's one of the three ways to run any model, and often the cheapest for light or personal use (see local vs API vs renting a GPU for the full comparison). The catch is that you're limited by your own VRAM — big models may not fit or may run slowly — so it's worth checking what your hardware can handle first. Spanvero recommends Ollama (among other local runners) whenever your own machine is the cheapest way to run a given model, and shows the honest VRAM you'd need. Browse models sized for your card at /models/8gb-vram/ or /models/16gb-vram/, and use the calculator at /calculator/ to compare running locally in Ollama against renting a GPU or using your own API key — with $0 markup either way.

llama.cpp · LM Studio · GGUF · Local vs API vs renting a GPU · vLLM · Base vs instruct model

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

Ollama

Related

The weekly price index