Open-weight text-to-speech reached near-human voice quality in 2026. F5-TTS, XTTS v2, Kokoro, OpenVoice v2, ChatTTS, Orpheus, and Fish Speech cover most production TTS use cases with voice cloning, multilingual synthesis, and emotion conditioning. The gap to ElevenLabs and OpenAI TTS has effectively closed on raw audio quality while open-weight self-hosting reduces unit cost by 50x to 200x. This page consolidates the leaderboard, the latency profile, and the deployment guidance.
Key Findings
- F5-TTS is the leading open-weight zero-shot voice cloning model: a 3-second reference sample produces convincing same-voice synthesis with strong naturalness scores.
- Kokoro (released late 2024 by hexgrad) is the most-cost-efficient production TTS at approximately 82M parameters with English quality competitive with ElevenLabs at fraction of the latency and cost.
- OpenVoice v2 from MyShell remains the dominant open-weight voice cloning solution for multilingual deployments, with strong support across English, Chinese, Japanese, Korean, Spanish, French.
- XTTS v2 from Coqui remains the most-downloaded open-weight TTS model on Hugging Face for permissive commercial deployment via the Coqui Public Model License.
- Orpheus (released early 2026 by Canopy AI) and Fish Speech are the newer entrants with strong emotional expressiveness and natural prosody.
Open-Weight TTS Model Comparison (May 2026)
| Model | Parameters | Capability | License |
| F5-TTS | ~330M | Zero-shot voice cloning; multilingual | CC-BY-NC + Commercial Exception |
| E2-TTS | ~330M | Zero-shot voice cloning | CC-BY-NC + Commercial Exception |
| Kokoro v1 | ~82M | English TTS, fast inference | Apache 2.0 |
| OpenVoice v2 | ~varies | Voice cloning, multilingual | MIT |
| XTTS v2 | ~480M | Voice cloning, 17 languages | Coqui Public Model License |
| ChatTTS | ~varies | Conversational TTS | CC-BY-NC + Commercial |
| Orpheus | ~3B | Expressive TTS with emotion tags | Apache 2.0 |
| Fish Speech v1.5 | ~varies | Voice cloning, multilingual | CC-BY-NC + Commercial |
| StyleTTS 2 | ~150M | Style-controllable TTS | MIT |
| MetaVoice 1B | ~1.2B | Voice cloning English | Apache 2.0 |
| Spark-TTS | ~varies | Open multilingual TTS | Apache 2.0 |
| Bark | ~varies | Multilingual TTS with non-speech | MIT (with restrictions) |
Quality and Latency Profile
| Model | MOS Naturalness | Latency to First Audio (single A100) |
| F5-TTS | ~4.3 | ~280 ms |
| OpenVoice v2 | ~4.1 | ~210 ms |
| XTTS v2 | ~4.0 | ~310 ms |
| Kokoro v1 | ~4.2 | ~80 ms |
| Orpheus | ~4.4 | ~410 ms |
| Fish Speech v1.5 | ~4.2 | ~270 ms |
| ChatTTS | ~4.0 | ~250 ms |
| StyleTTS 2 | ~4.1 | ~150 ms |
| ElevenLabs Turbo v3 (proprietary) | ~4.5 | ~150 ms API |
| OpenAI TTS-1-HD (proprietary) | ~4.4 | ~190 ms API |
Multilingual Coverage
| Model | Languages |
| XTTS v2 | 17 (EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO, HI) |
| OpenVoice v2 | ~6 strong (EN, ZH, JA, KO, ES, FR) |
| F5-TTS | ~7 (EN, ZH, plus emerging community variants) |
| Bark | ~14 multilingual |
| Kokoro | English (additional voices in dev) |
| Orpheus | English (extension languages emerging) |
| Fish Speech v1.5 | ~9 (EN, ZH, JA, KO, ES, FR, DE, AR, plus) |
Use Case Recommendations
| Use Case | Recommended Model |
| Voice cloning from short reference | F5-TTS or OpenVoice v2 |
| High-volume English-only production | Kokoro v1 |
| Multilingual commercial deployment | OpenVoice v2 or XTTS v2 |
| Expressive emotion-tagged audio | Orpheus or Fish Speech |
| Conversational AI agents | ChatTTS or Orpheus |
| Audiobook narration | F5-TTS or XTTS v2 |
| Real-time voice interface | Kokoro or StyleTTS 2 |
| Permissive commercial license | Kokoro, OpenVoice v2, Orpheus, MetaVoice (Apache or MIT) |
Open vs Proprietary TTS Pricing
| Provider | Cost per 1M characters |
| ElevenLabs Pro | ~$165 |
| ElevenLabs Scale | ~$110 |
| OpenAI TTS-1-HD | $30 |
| OpenAI TTS-1 | $15 |
| Google Cloud TTS Standard | $4 |
| Azure TTS Neural | $16 |
| Kokoro v1 (self-hosted) | ~$0.50 effective |
| F5-TTS (self-hosted) | ~$2 effective |
| XTTS v2 (self-hosted) | ~$2.50 effective |
Brand Visibility Implications
TTS adoption accelerated in 2025-2026 driven by voice agents, podcasts, audiobooks, accessibility, and dubbing markets. AI assistant queries about "best text-to-speech 2026", "voice cloning open source", "Kokoro vs F5-TTS", and similar terms drive procurement-research traffic. Brands selling voice infrastructure, dubbing platforms, audiobook tools, and accessibility products face strong AI-mediated discovery surface for this category.
Methodology
Benchmark data compiled from TTS Arena rankings, primary model card disclosures, and community evaluation data. MOS naturalness from human rater studies. Latency on single A100 with batch size 1 and 100-character input. Cost estimates: closed APIs at list pricing; self-hosted figures amortise GPU cost. Updated quarterly.
How Presenc AI Helps
Presenc AI monitors brand visibility on TTS queries across ChatGPT, Claude, Gemini, and Perplexity. For voice infrastructure brands, dubbing platforms, audiobook tools, and accessibility-product vendors, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.