Spanvero How it works Find a model Compare models Pricing

TTS / ASR (text-to-speech & speech recognition)

TTS (text-to-speech) turns written text into spoken audio; ASR (automatic speech recognition) does the reverse, transcribing speech into text.

TTS and ASR are the two main voice modalities in AI, and they're mirror images of each other. TTS — text-to-speech — takes written text and synthesizes natural-sounding spoken audio. ASR — automatic speech recognition, also called speech-to-text — listens to audio and writes down what was said. Together they cover the "hearing" and "speaking" ends of voice applications, and thanks to strong open models, both can run on modest hardware, often even in real time. They're a distinct modality from text LLMs and from image diffusion models, which changes how you think about running and pricing them.

On the TTS side, modern open models produce speech that ranges from clearly synthetic to strikingly human, and many support extra capabilities: multiple languages, control over tone and pacing, and voice cloning (generating speech in a target voice from a short sample). Open TTS models vary widely in size and quality — some small ones run in real time on a CPU, while more expressive models want a GPU. This is the technology behind narration, voice assistants, accessibility tools, and audio content generation. A practical caveat worth flagging: some capable open voice-cloning models are released under non-commercial licenses, so check the license before using them in a product.

On the ASR side, the best-known open family is OpenAI's Whisper, which set the bar for open transcription and comes in several sizes trading accuracy for speed. There's now a healthy ecosystem of alternatives and derivatives — Distil-Whisper (a faster, distilled version), NVIDIA's Parakeet and Canary models, Moonshine, and others — that push speed, accuracy, or efficiency further. ASR is what powers transcription, subtitles, voice commands, meeting notes, and call analysis. Many ASR models are small enough to run comfortably on a modest GPU, and some transcribe faster than real time even on a CPU. Accuracy varies by language, audio quality, accents, and background noise, so the right model depends on your specific audio.

The important thing for planning and cost is that these are audio models, judged and priced differently from text LLMs. Cost isn't measured per token here. For TTS, cost typically scales with the amount of text or the seconds of audio produced. For ASR, it scales with the minutes (or hours) of audio transcribed. When comparing voice models, you care about things like which languages they support, whether they run in real time, their VRAM footprint, and — for TTS — voice quality and cloning ability, or — for ASR — transcription accuracy on audio like yours. As with image models, we don't quote quality or word-error-rate benchmarks we didn't run; we surface the recognized open models with their honest hardware and license facts and let you judge fit for your use.

A practical reason voice models are attractive to self-host is that they're often light. Compared with large language models or video generators, many TTS and ASR models are small — a good ASR model can transcribe faster than real time on a modest GPU or even a CPU, and several TTS models run comfortably on everyday hardware. That makes the "$0 on your own machine" route genuinely realistic for voice work, and it keeps sensitive audio (calls, meetings, personal recordings) on your own device rather than sending it to a third party, which is a real privacy advantage for this modality in particular.

A few honest considerations when picking a voice model. For ASR, accuracy depends heavily on your specific conditions — language, accent, domain vocabulary, and background noise all matter, so the "best" model for clean English podcast audio may not be best for noisy multilingual phone calls; testing on audio like yours beats trusting a single benchmark. For TTS, the trade-offs are naturalness, speed, language coverage, and whether voice cloning is supported — and, importantly, the license, since several capable open voice-cloning models are released for non-commercial use only. As with image and video models, we surface the recognized options with their real hardware and license facts and leave the quality judgment to you, rather than quoting benchmarks we didn't run.

In Spanvero these sit in the media/audio part of the catalog, alongside image and video models (the image side is covered in the text-to-image and diffusion model explainers), so you can filter to voice models specifically and compare their objective run costs. Find the recognized open TTS and voice-cloning models at /best/best-open-text-to-speech-models/ and the open speech-to-text / Whisper alternatives at /best/best-open-speech-to-text-models/, browse the broader media catalog under /models/, and use the calculator at /calculator/ to compare the honest, $0-markup cost of running a voice model locally versus on a rented GPU.

Diffusion model · Text-to-image · Inference · Local vs API vs renting a GPU · VRAM · Tokens

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

TTS / ASR (text-to-speech & speech recognition)

Related

The weekly price index