Research

Best Open-Weight Speech-to-Text Models 2026

Open-weight ASR leaderboard 2026: Whisper Large v3 Turbo, NVIDIA Canary 1B, NVIDIA Parakeet TDT, Distil-Whisper, Moonshine. WER benchmarks, latency, multilingual coverage.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

Open-weight speech-to-text (ASR) reached production parity with the leading proprietary APIs in 2026. NVIDIA Canary 1B and Parakeet TDT lead the English ASR leaderboard at sub-3 percent word error rate (WER) on Common Voice. Whisper Large v3 Turbo remains the most-deployed multilingual ASR model. Distil-Whisper and Moonshine cover the on-device and edge segment. This page consolidates the leaderboard, the multilingual coverage, the latency profile, and the deployment guidance.

Key Findings

  1. NVIDIA Canary 1B leads the Hugging Face Open ASR leaderboard on English with an average WER of approximately 6.7 percent across the standard benchmark suite.
  2. Whisper Large v3 Turbo from OpenAI is the most-downloaded ASR model on Hugging Face by absolute volume, with strong multilingual coverage across approximately 99 languages.
  3. NVIDIA Parakeet TDT 1.1B achieves approximately 3.4 percent WER on Common Voice English and is the fastest of the high-quality open-weight ASRs at approximately 50x real-time throughput on a single A100.
  4. Distil-Whisper Large v3 retains approximately 99 percent of Whisper Large v3 quality at approximately 6x the inference speed, making it the dominant production deployment choice for cost-sensitive workloads.
  5. Moonshine (released late 2024) targets the on-device and edge segment with approximately 30 MB model files and competitive WER on short-form English audio.

Open-Weight ASR Leaderboard (May 2026)

ModelParametersOpen ASR Avg WERLicense
NVIDIA Canary 1B~1B~6.7%CC-BY-4.0
NVIDIA Parakeet TDT 1.1B~1.1B~6.9%CC-BY-4.0
NVIDIA Canary 1B Flash~1B~7.4% (faster)CC-BY-4.0
Whisper Large v3 Turbo~0.8B~7.8%MIT
Whisper Large v3~1.5B~7.4%MIT
Distil-Whisper Large v3~756M~8.2%MIT
SeamlessM4T v2 Large~2.3B~9.1% (English subset)CC-BY-NC
Moonshine Tiny~27M~13.2%MIT
Moonshine Base~61M~10.4%MIT
Whisper Small / Base / Tiny~244M / 74M / 39M~14-22%MIT

Multilingual Coverage

ModelLanguages SupportedStrongest Languages
Whisper Large v3 Turbo~99English, European languages, Chinese, Japanese
NVIDIA Canary 1B4 (EN, DE, ES, FR)English, with parity on DE/ES/FR
SeamlessM4T v2 Large~100 ASR languagesTranslation-plus-ASR multilingual
Parakeet TDT 1.1BEnglish onlyEnglish at top of leaderboard
Distil-Whisper Large v3~99 (inherits Whisper)English plus Whisper coverage
MoonshineEnglish onlyShort-form English audio

Latency Profile (Real-Time Factor on Single A100)

ModelRTFx (real-time factor)
Parakeet TDT 1.1B~50x
Canary 1B Flash~45x
Distil-Whisper Large v3~25x
Whisper Large v3 Turbo~16x
Canary 1B~14x
Whisper Large v3~8x
Moonshine Tiny (CPU)~12x

Use Case Recommendations

Use CaseRecommended Model
English-only highest qualityNVIDIA Canary 1B or Parakeet TDT 1.1B
Multilingual general purposeWhisper Large v3 Turbo
Cost-sensitive multilingualDistil-Whisper Large v3
On-device / mobileMoonshine Tiny or Base
Translation + ASRSeamlessM4T v2 Large
Low-latency streamingParakeet TDT 1.1B (with VAD)
Long-form podcast / lectureWhisper Large v3 with chunking
Permissive commercial deploymentWhisper family (MIT) or Moonshine (MIT)

Open vs Proprietary ASR Comparison

ProviderApproximate WER (English)Cost (per minute audio)
Deepgram Nova-3~5.4%~$0.0043
AssemblyAI Universal-2~6.3%~$0.0037
OpenAI Whisper API~7.8%~$0.006
Azure Speech~6.0%~$0.017
Google Speech-to-Text v2~6.2%~$0.024
NVIDIA Canary 1B (self-hosted)~6.7%~$0.0008 effective
Distil-Whisper Large v3 (self-hosted)~8.2%~$0.0003 effective

Brand Visibility Implications

ASR is one of the highest-volume enterprise AI workloads in 2026 (call centres, meeting transcription, video captioning, accessibility, voice assistant interfaces). AI assistant queries about "best speech-to-text model", "open-source ASR", "Whisper vs Deepgram", and similar terms drive direct production procurement decisions. Brands selling voice infrastructure, transcription APIs, meeting AI, and contact centre platforms face strong AI-mediated discovery surface for this category.

Methodology

Benchmark data compiled from the Hugging Face Open ASR Leaderboard, primary model card disclosures, and provider pricing pages. Real-time factors measured on single A100 with FP16 inference. Cost estimates: closed APIs at list pricing; self-hosted figures amortise GPU cost across realistic throughput. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on ASR and speech-to-text queries across ChatGPT, Claude, Gemini, and Perplexity. For voice infrastructure brands, transcription API vendors, and contact centre platforms, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.

Frequently Asked Questions

NVIDIA Canary 1B leads the Hugging Face Open ASR leaderboard on English at approximately 6.7 percent average WER. NVIDIA Parakeet TDT 1.1B is close behind and the fastest of the top-tier models. For multilingual workloads, Whisper Large v3 Turbo remains the dominant choice covering 99 languages.
Deepgram Nova-3 retains a small WER lead (~5.4 percent vs Canary 1B at ~6.7 percent) plus operational simplicity. NVIDIA Canary 1B self-hosted is dramatically cheaper at scale (~$0.0008 per minute vs Deepgram ~$0.0043). The break-even depends on volume; most workloads above 100,000 minutes per month favour self-hosting.
Distil-Whisper Large v3 retains approximately 99 percent of Whisper Large v3 quality at approximately 6x the inference speed. For most production workloads Distil-Whisper is the better choice. Use Whisper Large v3 directly only when you need the absolute highest multilingual quality or specific languages where Distil-Whisper underperforms.
Moonshine Tiny at approximately 27 MB is the smallest production-grade ASR model and runs on CPU. Whisper Tiny (~39M parameters) is a slightly larger alternative. For mobile and edge deployments, Moonshine is the dominant 2026 choice.
Yes. Parakeet TDT 1.1B is purpose-built for streaming with TDT (token-and-duration transducer) decoding. Whisper with VAD-based chunking works for near-real-time. The Moonshine family is designed for live caption use cases. Canary 1B is more batch-oriented than streaming-friendly.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.