Diffusion model

The dominant architecture for AI image (and video) generation: it learns to turn random noise into a coherent image by removing noise step by step, guided by your prompt.

A diffusion model is trained by progressively adding noise to real images and learning to reverse that process. To generate, it starts from pure random noise and "denoises" over many steps until a clean image emerges, steered by a text prompt. Stable Diffusion and similar open models work this way.

More denoising steps generally mean higher quality but slower generation. Many modern systems are latent diffusion models, which run the process in a compressed space for speed, and they're commonly customized with LoRA adapters to add specific styles or subjects without retraining the whole model.

Diffusion models are a different modality from text LLMs — they output pixels, not tokens — so their hardware needs and "run cost" are estimated differently. In Spanvero's catalog these fall under media (image/video) models rather than text LLMs.

Related

Text-to-image · LoRA · TTS / ASR (text-to-speech & speech recognition) · VRAM

All explainers → · Browse models →

Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.