Spanvero How it works Find a model Compare models Pricing

Diffusion model

The dominant architecture for AI image (and video) generation: it learns to turn random noise into a coherent image by removing noise step by step, guided by your prompt.

Diffusion models are the technology behind essentially all modern open-source image generation, and once you understand the basic idea, the strange-sounding settings — steps, guidance scale, seeds — start to make sense. A diffusion model generates images through a process that runs, quite literally, in reverse of how it was trained. It's a fundamentally different kind of model from the text LLMs most people first encounter, with different hardware needs and a different way of thinking about cost.

Here's the intuition. During training, a diffusion model is shown real images that have had noise progressively added to them, step by step, until they're pure static — and it learns to predict and reverse that noise, to "denoise" one step at a time. To generate a brand-new image, you flip the process: start from a canvas of pure random noise and let the model denoise it over many steps, gradually pulling a coherent image out of the chaos. Crucially, this denoising is steered by a text prompt (using a text encoder that turns your words into guidance), so the image that emerges matches your description. Stable Diffusion, FLUX, and the other open image models all work on this denoising principle.

Several practical knobs fall directly out of how diffusion works. The number of denoising steps trades quality for speed: more steps generally give a cleaner, more refined image but take proportionally longer, and each model has a sweet spot beyond which extra steps add little. The guidance scale (often "CFG") controls how strictly the model sticks to your prompt versus generating something more freely — too low and it ignores your words, too high and images can look over-cooked. A random seed sets the initial noise, so fixing the seed makes a generation reproducible (same prompt + same seed = same image), which is invaluable for iterating. Negative prompts let you specify things to avoid.

Most modern systems are latent diffusion models, an important efficiency trick: instead of denoising in the full high-resolution pixel space (which would be very slow and memory-heavy), they run the diffusion process in a compressed "latent" space and only decode to full pixels at the end. This is a big part of why generating an image on consumer hardware is feasible at all. Diffusion models are also extremely commonly customized with LoRA adapters — small add-on files that inject a specific art style, character, or subject without retraining the whole model (the same LoRA concept used for text models). This has produced huge community libraries of styles you can stack onto a base model.

Diffusion is not only for images. The same denoising principle now underpins most open text-to-video and image-to-video models, which are far heavier than image models — video adds the dimension of time, so a model has to keep many frames coherent, and the VRAM and time per generation climb steeply. This is why video generation is squarely rented-GPU or high-end-hardware territory, whereas many image models run comfortably on a mid-range card. The trade-offs (steps, guidance, seed) carry over, with extra knobs for frame count and length.

Running diffusion models locally usually means one of a few ecosystems rather than the GGUF/llama.cpp stack used for text. The most common are ComfyUI, a node-based interface that gives fine control over the whole generation pipeline, and libraries like Diffusers for a code-first workflow. These load the model (typically in safetensors form) and expose the diffusion parameters directly. It's a different toolchain from the text-model world, so the runners and file formats you learned for LLMs don't transfer one-to-one — but the underlying hardware question is the same: does it fit in your VRAM, and how long does each generation take.

For planning and cost, the key mental shift is that diffusion models are a different modality from text LLMs — they output pixels, not tokens. So their cost and hardware fit are measured per image, not per token, and driven by different factors: the model size, the output resolution, and the number of steps all affect how much VRAM each image needs and how long it takes. Some open image models are also license-restricted (for example, certain FLUX variants are non-commercial while others are permissively licensed), which matters if you plan to use the output commercially. In Spanvero's catalog these fall under media (image and video) models rather than text LLMs, with run cost reported per generation. The specific task of turning words into pictures is covered in the text-to-image explainer, and the audio counterparts in TTS / ASR.

Text-to-image · LoRA · TTS / ASR (text-to-speech & speech recognition) · VRAM · Inference · Local vs API vs renting a GPU

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

Diffusion model

Related

The weekly price index