Generating an image from a written description (a prompt); today this is almost always done with diffusion models.
Text-to-image is the task of producing a picture from words — you type "a watercolor fox in a snowy forest" and the model renders it. The leading open approach is diffusion models, which iteratively denoise random noise into an image conditioned on your prompt.
Results are shaped by the prompt, optional negative prompts (things to avoid), the number of steps, a guidance scale that controls how strictly it follows the prompt, and a random seed that makes runs reproducible. Output resolution and step count drive how much VRAM and time each image takes.
Text-to-image models are part of Spanvero's media catalog (separate from text LLMs), where run-cost and hardware fit are reported per-image rather than per-token.
Diffusion model · LoRA · VRAM · TTS / ASR (text-to-speech & speech recognition)
All explainers → · Browse models →
Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.