TTS (text-to-speech) turns written text into spoken audio; ASR (automatic speech recognition) does the reverse, transcribing speech into text.
These are the two main voice modalities. TTS takes text and synthesizes natural-sounding speech (open models can clone voices or speak in multiple languages). ASR — also called speech-to-text — listens to audio and writes down what was said; Whisper is the best-known open ASR family.
Like image models, these are audio models rather than text LLMs, so they're judged and priced differently: TTS cost often scales with characters or seconds of audio, and ASR with minutes of audio transcribed. Many run comfortably on modest hardware, and some run in real time.
In Spanvero these sit in the media/audio part of the catalog alongside image and video models, so you can filter to voice models specifically and compare their objective run costs.
Diffusion model · Text-to-image · Inference · Local vs API vs renting a GPU
All explainers → · Browse models →
Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.