Spanvero How it works Find a model Compare models Pricing

LoRA

A cheap fine-tuning method that freezes the base model and trains tiny add-on "adapter" matrices, producing a small file you can stack on top of the original weights.

LoRA — short for Low-Rank Adaptation — is the technique that made customizing open models affordable for ordinary people, and it's worth understanding because it changes what fine-tuning costs from "needs a data center" to "fits on a single consumer GPU." It's the most popular way to adapt an open model to your own data without the expense of retraining the whole thing.

The core idea is elegant. Full fine-tuning updates all of a model's billions of parameters, which requires huge amounts of GPU memory (you need room for the weights plus optimizer state and gradients for every one of them) and a lot of compute. LoRA sidesteps this. It freezes the original base model entirely — those billions of weights never change — and instead inserts small pairs of low-rank matrices (the "adapter") into the model's layers, and trains only those. Because the adapter has a tiny fraction of the parameters, training uses far less memory and time, and the result you save is a small adapter file, often just a few megabytes to a few hundred megabytes, rather than a full multi-gigabyte model.

This smallness unlocks a nice workflow. Since the adapter is separate from the base, you can keep one copy of the base model and swap different LoRAs on top of it: one adapter that gives it a specific coding style, another for a customer-service persona, another tuned on your company's documents. You load the base once and apply whichever adapter you need. LoRAs can also be merged back into the base weights permanently if you'd rather ship a single standalone model.

A closely related variant, QLoRA, pushes accessibility even further. QLoRA fine-tunes a LoRA adapter on top of a base model that has been quantized to 4-bit, dramatically cutting the memory needed for the base during training. The combination means you can fine-tune surprisingly large models on a single consumer GPU — something that was out of reach for individuals just a few years ago. (For the quantization background, see the quantization explainer.)

LoRA isn't only a text-model tool. It's just as common in the image-generation world, where small LoRA files add a specific art style, character, or subject to a diffusion model without retraining it — you'll find enormous libraries of these community-made adapters for Stable Diffusion and similar models. The concept is identical: a small, stackable modification on top of a frozen base.

Where does LoRA fit in the bigger picture? It's a form of parameter-efficient fine-tuning, and fine-tuning in general is the right tool for teaching a model a style, format, or task behavior — not for injecting fresh facts, where retrieval (feeding documents into the context, often via embeddings) usually works better and cheaper. The base you start from also matters: you typically apply a LoRA to a base or instruct model depending on your goal, and the instruct models most people use are themselves the product of fine-tuning a base model.

A couple of practical parameters govern how a LoRA behaves. The most important is the rank — the "low-rank" size of the adapter matrices. A higher rank gives the adapter more capacity to learn (better for bigger behavior changes) at the cost of a larger file and more training memory; a lower rank is lighter and often plenty for style or format tweaks. There's also a scaling factor (often called alpha) that controls how strongly the adapter influences the base model. You don't need to master these to use existing LoRAs, but they explain why two adapters for the same model can differ in size and effect.

It's also worth being clear about LoRA's limits. Because it only trains a small set of add-on parameters on top of a frozen base, a LoRA is excellent for nudging behavior, style, format, and narrow-domain adaptation — but it can't teach a model fundamentally new capabilities the base lacks, and it's not the right tool for injecting large bodies of fresh factual knowledge (retrieval with embeddings is better and cheaper for that, as covered in the fine-tuning explainer). Think of a LoRA as a steering adjustment on an existing model, not a way to build a new one.

Because LoRA adapters are usually distributed as small safetensors files that layer onto a base you can run yourself, they fit naturally into the self-hosting story Spanvero exists to make transparent. If you're planning to fine-tune with LoRA or QLoRA and then serve the result, you can compare the honest cost of running the base model locally, on a rented GPU, or via your own API key at /calculator/, and browse suitable base models to adapt under /models/.

Fine-tuning · Base vs instruct model · Quantization · Diffusion model · Safetensors · Inference

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

LoRA

Related

The weekly price index