Open-weight speech-to-text (ASR) reached production parity with the leading proprietary APIs in 2026. NVIDIA Canary 1B and Parakeet TDT lead the English ASR leaderboard at sub-3 percent word error rate (WER) on Common Voice. Whisper Large v3 Turbo remains the most-deployed multilingual ASR model. Distil-Whisper and Moonshine cover the on-device and edge segment. This page consolidates the leaderboard, the multilingual coverage, the latency profile, and the deployment guidance.
Key Findings
- NVIDIA Canary 1B leads the Hugging Face Open ASR leaderboard on English with an average WER of approximately 6.7 percent across the standard benchmark suite.
- Whisper Large v3 Turbo from OpenAI is the most-downloaded ASR model on Hugging Face by absolute volume, with strong multilingual coverage across approximately 99 languages.
- NVIDIA Parakeet TDT 1.1B achieves approximately 3.4 percent WER on Common Voice English and is the fastest of the high-quality open-weight ASRs at approximately 50x real-time throughput on a single A100.
- Distil-Whisper Large v3 retains approximately 99 percent of Whisper Large v3 quality at approximately 6x the inference speed, making it the dominant production deployment choice for cost-sensitive workloads.
- Moonshine (released late 2024) targets the on-device and edge segment with approximately 30 MB model files and competitive WER on short-form English audio.
Open-Weight ASR Leaderboard (May 2026)
| Model | Parameters | Open ASR Avg WER | License |
| NVIDIA Canary 1B | ~1B | ~6.7% | CC-BY-4.0 |
| NVIDIA Parakeet TDT 1.1B | ~1.1B | ~6.9% | CC-BY-4.0 |
| NVIDIA Canary 1B Flash | ~1B | ~7.4% (faster) | CC-BY-4.0 |
| Whisper Large v3 Turbo | ~0.8B | ~7.8% | MIT |
| Whisper Large v3 | ~1.5B | ~7.4% | MIT |
| Distil-Whisper Large v3 | ~756M | ~8.2% | MIT |
| SeamlessM4T v2 Large | ~2.3B | ~9.1% (English subset) | CC-BY-NC |
| Moonshine Tiny | ~27M | ~13.2% | MIT |
| Moonshine Base | ~61M | ~10.4% | MIT |
| Whisper Small / Base / Tiny | ~244M / 74M / 39M | ~14-22% | MIT |
Multilingual Coverage
| Model | Languages Supported | Strongest Languages |
| Whisper Large v3 Turbo | ~99 | English, European languages, Chinese, Japanese |
| NVIDIA Canary 1B | 4 (EN, DE, ES, FR) | English, with parity on DE/ES/FR |
| SeamlessM4T v2 Large | ~100 ASR languages | Translation-plus-ASR multilingual |
| Parakeet TDT 1.1B | English only | English at top of leaderboard |
| Distil-Whisper Large v3 | ~99 (inherits Whisper) | English plus Whisper coverage |
| Moonshine | English only | Short-form English audio |
Latency Profile (Real-Time Factor on Single A100)
| Model | RTFx (real-time factor) |
| Parakeet TDT 1.1B | ~50x |
| Canary 1B Flash | ~45x |
| Distil-Whisper Large v3 | ~25x |
| Whisper Large v3 Turbo | ~16x |
| Canary 1B | ~14x |
| Whisper Large v3 | ~8x |
| Moonshine Tiny (CPU) | ~12x |
Use Case Recommendations
| Use Case | Recommended Model |
| English-only highest quality | NVIDIA Canary 1B or Parakeet TDT 1.1B |
| Multilingual general purpose | Whisper Large v3 Turbo |
| Cost-sensitive multilingual | Distil-Whisper Large v3 |
| On-device / mobile | Moonshine Tiny or Base |
| Translation + ASR | SeamlessM4T v2 Large |
| Low-latency streaming | Parakeet TDT 1.1B (with VAD) |
| Long-form podcast / lecture | Whisper Large v3 with chunking |
| Permissive commercial deployment | Whisper family (MIT) or Moonshine (MIT) |
Open vs Proprietary ASR Comparison
| Provider | Approximate WER (English) | Cost (per minute audio) |
| Deepgram Nova-3 | ~5.4% | ~$0.0043 |
| AssemblyAI Universal-2 | ~6.3% | ~$0.0037 |
| OpenAI Whisper API | ~7.8% | ~$0.006 |
| Azure Speech | ~6.0% | ~$0.017 |
| Google Speech-to-Text v2 | ~6.2% | ~$0.024 |
| NVIDIA Canary 1B (self-hosted) | ~6.7% | ~$0.0008 effective |
| Distil-Whisper Large v3 (self-hosted) | ~8.2% | ~$0.0003 effective |
Brand Visibility Implications
ASR is one of the highest-volume enterprise AI workloads in 2026 (call centres, meeting transcription, video captioning, accessibility, voice assistant interfaces). AI assistant queries about "best speech-to-text model", "open-source ASR", "Whisper vs Deepgram", and similar terms drive direct production procurement decisions. Brands selling voice infrastructure, transcription APIs, meeting AI, and contact centre platforms face strong AI-mediated discovery surface for this category.
Methodology
Benchmark data compiled from the Hugging Face Open ASR Leaderboard, primary model card disclosures, and provider pricing pages. Real-time factors measured on single A100 with FP16 inference. Cost estimates: closed APIs at list pricing; self-hosted figures amortise GPU cost across realistic throughput. Updated quarterly.
How Presenc AI Helps
Presenc AI monitors brand visibility on ASR and speech-to-text queries across ChatGPT, Claude, Gemini, and Perplexity. For voice infrastructure brands, transcription API vendors, and contact centre platforms, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.