What is the best open-weight TTS model in 2026?

For voice cloning, F5-TTS leads. For English production at scale, Kokoro v1 is the dominant choice with Apache 2.0 licence. For expressive emotional speech, Orpheus is the strongest. For multilingual commercial deployment, OpenVoice v2 (MIT) and XTTS v2 cover the most languages.

Can open-weight TTS replace ElevenLabs?

On quality, the gap is small (~0.2 MOS naturalness in favour of ElevenLabs Turbo v3). On cost, self-hosted open-weight TTS is dramatically cheaper at scale (~$0.50 per million characters for Kokoro vs ~$165 for ElevenLabs Pro). The break-even depends on volume; most workloads above 5 million characters per month favour self-hosting.

How fast is voice cloning with F5-TTS?

F5-TTS is zero-shot voice cloning, meaning a 3-second reference sample is enough to generate same-voice output without per-voice training. The reference is processed alongside the target text in a single forward pass. Quality drops if the reference is shorter than approximately 2 seconds or if the reference audio is noisy.

Which TTS model is best for real-time agents?

Kokoro v1 leads on latency-to-first-audio at approximately 80 ms on a single A100, making it the dominant choice for real-time voice agents. StyleTTS 2 is the second-fastest at approximately 150 ms. For lower-latency budgets you may need TTS-Arena-V2 specifics or a GPU upgrade.

Is voice cloning safe to deploy?

It depends on your safeguards. Open-weight voice cloning models (F5-TTS, OpenVoice v2, XTTS v2) can clone any voice from a short reference, which raises consent and misuse concerns. Production deployments should include voice provenance verification, consent collection, watermarking, and audit logging. The C2PA voice provenance standard and watermarking tools from MIT and ElevenLabs are the emerging industry baseline.

Best Open-Weight Text-to-Speech Models 2026

Open-weight text-to-speech reached near-human voice quality in 2026. F5-TTS, XTTS v2, Kokoro, OpenVoice v2, ChatTTS, Orpheus, and Fish Speech cover most production TTS use cases with voice cloning, multilingual synthesis, and emotion conditioning. The gap to ElevenLabs and OpenAI TTS has effectively closed on raw audio quality while open-weight self-hosting reduces unit cost by 50x to 200x. This page consolidates the leaderboard, the latency profile, and the deployment guidance.

Key Findings

F5-TTS is the leading open-weight zero-shot voice cloning model: a 3-second reference sample produces convincing same-voice synthesis with strong naturalness scores.
Kokoro (released late 2024 by hexgrad) is the most-cost-efficient production TTS at approximately 82M parameters with English quality competitive with ElevenLabs at fraction of the latency and cost.
OpenVoice v2 from MyShell remains the dominant open-weight voice cloning solution for multilingual deployments, with strong support across English, Chinese, Japanese, Korean, Spanish, French.
XTTS v2 from Coqui remains the most-downloaded open-weight TTS model on Hugging Face for permissive commercial deployment via the Coqui Public Model License.
Orpheus (released early 2026 by Canopy AI) and Fish Speech are the newer entrants with strong emotional expressiveness and natural prosody.

Open-Weight TTS Model Comparison (May 2026)

Model	Parameters	Capability	License
F5-TTS	~330M	Zero-shot voice cloning; multilingual	CC-BY-NC + Commercial Exception
E2-TTS	~330M	Zero-shot voice cloning	CC-BY-NC + Commercial Exception
Kokoro v1	~82M	English TTS, fast inference	Apache 2.0
OpenVoice v2	~varies	Voice cloning, multilingual	MIT
XTTS v2	~480M	Voice cloning, 17 languages	Coqui Public Model License
ChatTTS	~varies	Conversational TTS	CC-BY-NC + Commercial
Orpheus	~3B	Expressive TTS with emotion tags	Apache 2.0
Fish Speech v1.5	~varies	Voice cloning, multilingual	CC-BY-NC + Commercial
StyleTTS 2	~150M	Style-controllable TTS	MIT
MetaVoice 1B	~1.2B	Voice cloning English	Apache 2.0
Spark-TTS	~varies	Open multilingual TTS	Apache 2.0
Bark	~varies	Multilingual TTS with non-speech	MIT (with restrictions)

Quality and Latency Profile

Model	MOS Naturalness	Latency to First Audio (single A100)
F5-TTS	~4.3	~280 ms
OpenVoice v2	~4.1	~210 ms
XTTS v2	~4.0	~310 ms
Kokoro v1	~4.2	~80 ms
Orpheus	~4.4	~410 ms
Fish Speech v1.5	~4.2	~270 ms
ChatTTS	~4.0	~250 ms
StyleTTS 2	~4.1	~150 ms
ElevenLabs Turbo v3 (proprietary)	~4.5	~150 ms API
OpenAI TTS-1-HD (proprietary)	~4.4	~190 ms API

Multilingual Coverage

Model	Languages
XTTS v2	17 (EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO, HI)
OpenVoice v2	~6 strong (EN, ZH, JA, KO, ES, FR)
F5-TTS	~7 (EN, ZH, plus emerging community variants)
Bark	~14 multilingual
Kokoro	English (additional voices in dev)
Orpheus	English (extension languages emerging)
Fish Speech v1.5	~9 (EN, ZH, JA, KO, ES, FR, DE, AR, plus)

Use Case Recommendations

Use Case	Recommended Model
Voice cloning from short reference	F5-TTS or OpenVoice v2
High-volume English-only production	Kokoro v1
Multilingual commercial deployment	OpenVoice v2 or XTTS v2
Expressive emotion-tagged audio	Orpheus or Fish Speech
Conversational AI agents	ChatTTS or Orpheus
Audiobook narration	F5-TTS or XTTS v2
Real-time voice interface	Kokoro or StyleTTS 2
Permissive commercial license	Kokoro, OpenVoice v2, Orpheus, MetaVoice (Apache or MIT)

Open vs Proprietary TTS Pricing

Provider	Cost per 1M characters
ElevenLabs Pro	~$165
ElevenLabs Scale	~$110
OpenAI TTS-1-HD	$30
OpenAI TTS-1	$15
Google Cloud TTS Standard	$4
Azure TTS Neural	$16
Kokoro v1 (self-hosted)	~$0.50 effective
F5-TTS (self-hosted)	~$2 effective
XTTS v2 (self-hosted)	~$2.50 effective

Brand Visibility Implications

TTS adoption accelerated in 2025-2026 driven by voice agents, podcasts, audiobooks, accessibility, and dubbing markets. AI assistant queries about "best text-to-speech 2026", "voice cloning open source", "Kokoro vs F5-TTS", and similar terms drive procurement-research traffic. Brands selling voice infrastructure, dubbing platforms, audiobook tools, and accessibility products face strong AI-mediated discovery surface for this category.

Methodology

Benchmark data compiled from TTS Arena rankings, primary model card disclosures, and community evaluation data. MOS naturalness from human rater studies. Latency on single A100 with batch size 1 and 100-character input. Cost estimates: closed APIs at list pricing; self-hosted figures amortise GPU cost. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on TTS queries across ChatGPT, Claude, Gemini, and Perplexity. For voice infrastructure brands, dubbing platforms, audiobook tools, and accessibility-product vendors, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.