Research

Best Open-Weight Text-to-Speech Models 2026

Open-weight TTS leaderboard 2026: F5-TTS, XTTS v2, Kokoro, OpenVoice v2, ChatTTS, Orpheus, Fish Speech. Voice cloning quality, multilingual, latency, license analysis.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

Open-weight text-to-speech reached near-human voice quality in 2026. F5-TTS, XTTS v2, Kokoro, OpenVoice v2, ChatTTS, Orpheus, and Fish Speech cover most production TTS use cases with voice cloning, multilingual synthesis, and emotion conditioning. The gap to ElevenLabs and OpenAI TTS has effectively closed on raw audio quality while open-weight self-hosting reduces unit cost by 50x to 200x. This page consolidates the leaderboard, the latency profile, and the deployment guidance.

Key Findings

  1. F5-TTS is the leading open-weight zero-shot voice cloning model: a 3-second reference sample produces convincing same-voice synthesis with strong naturalness scores.
  2. Kokoro (released late 2024 by hexgrad) is the most-cost-efficient production TTS at approximately 82M parameters with English quality competitive with ElevenLabs at fraction of the latency and cost.
  3. OpenVoice v2 from MyShell remains the dominant open-weight voice cloning solution for multilingual deployments, with strong support across English, Chinese, Japanese, Korean, Spanish, French.
  4. XTTS v2 from Coqui remains the most-downloaded open-weight TTS model on Hugging Face for permissive commercial deployment via the Coqui Public Model License.
  5. Orpheus (released early 2026 by Canopy AI) and Fish Speech are the newer entrants with strong emotional expressiveness and natural prosody.

Open-Weight TTS Model Comparison (May 2026)

ModelParametersCapabilityLicense
F5-TTS~330MZero-shot voice cloning; multilingualCC-BY-NC + Commercial Exception
E2-TTS~330MZero-shot voice cloningCC-BY-NC + Commercial Exception
Kokoro v1~82MEnglish TTS, fast inferenceApache 2.0
OpenVoice v2~variesVoice cloning, multilingualMIT
XTTS v2~480MVoice cloning, 17 languagesCoqui Public Model License
ChatTTS~variesConversational TTSCC-BY-NC + Commercial
Orpheus~3BExpressive TTS with emotion tagsApache 2.0
Fish Speech v1.5~variesVoice cloning, multilingualCC-BY-NC + Commercial
StyleTTS 2~150MStyle-controllable TTSMIT
MetaVoice 1B~1.2BVoice cloning EnglishApache 2.0
Spark-TTS~variesOpen multilingual TTSApache 2.0
Bark~variesMultilingual TTS with non-speechMIT (with restrictions)

Quality and Latency Profile

ModelMOS NaturalnessLatency to First Audio (single A100)
F5-TTS~4.3~280 ms
OpenVoice v2~4.1~210 ms
XTTS v2~4.0~310 ms
Kokoro v1~4.2~80 ms
Orpheus~4.4~410 ms
Fish Speech v1.5~4.2~270 ms
ChatTTS~4.0~250 ms
StyleTTS 2~4.1~150 ms
ElevenLabs Turbo v3 (proprietary)~4.5~150 ms API
OpenAI TTS-1-HD (proprietary)~4.4~190 ms API

Multilingual Coverage

ModelLanguages
XTTS v217 (EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO, HI)
OpenVoice v2~6 strong (EN, ZH, JA, KO, ES, FR)
F5-TTS~7 (EN, ZH, plus emerging community variants)
Bark~14 multilingual
KokoroEnglish (additional voices in dev)
OrpheusEnglish (extension languages emerging)
Fish Speech v1.5~9 (EN, ZH, JA, KO, ES, FR, DE, AR, plus)

Use Case Recommendations

Use CaseRecommended Model
Voice cloning from short referenceF5-TTS or OpenVoice v2
High-volume English-only productionKokoro v1
Multilingual commercial deploymentOpenVoice v2 or XTTS v2
Expressive emotion-tagged audioOrpheus or Fish Speech
Conversational AI agentsChatTTS or Orpheus
Audiobook narrationF5-TTS or XTTS v2
Real-time voice interfaceKokoro or StyleTTS 2
Permissive commercial licenseKokoro, OpenVoice v2, Orpheus, MetaVoice (Apache or MIT)

Open vs Proprietary TTS Pricing

ProviderCost per 1M characters
ElevenLabs Pro~$165
ElevenLabs Scale~$110
OpenAI TTS-1-HD$30
OpenAI TTS-1$15
Google Cloud TTS Standard$4
Azure TTS Neural$16
Kokoro v1 (self-hosted)~$0.50 effective
F5-TTS (self-hosted)~$2 effective
XTTS v2 (self-hosted)~$2.50 effective

Brand Visibility Implications

TTS adoption accelerated in 2025-2026 driven by voice agents, podcasts, audiobooks, accessibility, and dubbing markets. AI assistant queries about "best text-to-speech 2026", "voice cloning open source", "Kokoro vs F5-TTS", and similar terms drive procurement-research traffic. Brands selling voice infrastructure, dubbing platforms, audiobook tools, and accessibility products face strong AI-mediated discovery surface for this category.

Methodology

Benchmark data compiled from TTS Arena rankings, primary model card disclosures, and community evaluation data. MOS naturalness from human rater studies. Latency on single A100 with batch size 1 and 100-character input. Cost estimates: closed APIs at list pricing; self-hosted figures amortise GPU cost. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on TTS queries across ChatGPT, Claude, Gemini, and Perplexity. For voice infrastructure brands, dubbing platforms, audiobook tools, and accessibility-product vendors, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.

Frequently Asked Questions

For voice cloning, F5-TTS leads. For English production at scale, Kokoro v1 is the dominant choice with Apache 2.0 licence. For expressive emotional speech, Orpheus is the strongest. For multilingual commercial deployment, OpenVoice v2 (MIT) and XTTS v2 cover the most languages.
On quality, the gap is small (~0.2 MOS naturalness in favour of ElevenLabs Turbo v3). On cost, self-hosted open-weight TTS is dramatically cheaper at scale (~$0.50 per million characters for Kokoro vs ~$165 for ElevenLabs Pro). The break-even depends on volume; most workloads above 5 million characters per month favour self-hosting.
F5-TTS is zero-shot voice cloning, meaning a 3-second reference sample is enough to generate same-voice output without per-voice training. The reference is processed alongside the target text in a single forward pass. Quality drops if the reference is shorter than approximately 2 seconds or if the reference audio is noisy.
Kokoro v1 leads on latency-to-first-audio at approximately 80 ms on a single A100, making it the dominant choice for real-time voice agents. StyleTTS 2 is the second-fastest at approximately 150 ms. For lower-latency budgets you may need TTS-Arena-V2 specifics or a GPU upgrade.
It depends on your safeguards. Open-weight voice cloning models (F5-TTS, OpenVoice v2, XTTS v2) can clone any voice from a short reference, which raises consent and misuse concerns. Production deployments should include voice provenance verification, consent collection, watermarking, and audit logging. The C2PA voice provenance standard and watermarking tools from MIT and ElevenLabs are the emerging industry baseline.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.