The Voice AI Production Stack in 2026
Voice AI call agents moved from demo-quality to production-deployable in 2024-2025 as latency-optimised voice models (OpenAI Realtime, Cartesia Sonic, ElevenLabs Turbo) closed the round-trip latency gap. By 2026, voice agents handle inbound qualification, outbound prospecting, customer support, scheduling, and verification calls at meaningful production scale. This page consolidates production benchmarks.
Key Findings
- Frontier voice agent platforms (Vapi, Retell AI, Synthflow, Bland AI) deliver end-to-end latency of approximately 500-900ms in 2026, down from 1.5-2.5 seconds in 2023.
- Word error rate (WER) on production calls is approximately 4-8 percent on clear-line calls, materially worse on noisy lines, accents, or technical vocabulary.
- Voice-agent conversion rates on outbound qualification calls are 15-40 percent of human-baseline; inbound qualification is closer to parity with human reps.
- Hold rates (caller stays on the line through agent introduction) are approximately 35-55 percent for outbound calls, materially worse than human reps; inbound is comparable to human.
- Underlying provider stack matters: latency, voice quality, and barge-in handling differ meaningfully across OpenAI Realtime, Cartesia, ElevenLabs, and Deepgram.
End-to-End Latency Benchmarks
| Platform | Median latency | P95 latency | Underlying stack |
|---|---|---|---|
| Vapi | ~500-700ms | ~1.2s | Configurable; commonly Cartesia + Deepgram + GPT-4o |
| Retell AI | ~600-800ms | ~1.4s | Configurable; Retell-tuned |
| Synthflow | ~700-900ms | ~1.5s | Configurable |
| Bland AI | ~500-700ms | ~1.3s | Custom voice model |
| OpenAI Realtime API | ~400-600ms | ~1.0s | End-to-end OpenAI stack |
| Direct Cartesia + Deepgram custom | ~400-600ms | ~0.9s | Custom integration |
Below 800ms median latency the conversation feels human; above 1 second the awkwardness becomes noticeable. Latency optimisation is the single dominant production focus in voice AI.
Voice Quality Comparisons
| TTS Provider | Latency to first audio (ms) | Voice quality (subjective) | Cloning support |
|---|---|---|---|
| OpenAI TTS (Realtime) | ~150-250 | Very high | No |
| Cartesia Sonic | ~80-120 | High | Yes (with consent) |
| ElevenLabs Turbo / Flash | ~150-250 | Very high | Yes |
| Deepgram Aura | ~100-200 | High | Limited |
| Google Cloud TTS | ~200-400 | Medium-high | Limited |
| Open-source (Coqui, etc.) | variable | Medium | Yes |
STT (Transcription) Quality on Live Calls
| STT Provider | WER on clean calls | WER on noisy calls | Streaming latency |
|---|---|---|---|
| Deepgram Nova-3 | ~3.5% | ~9% | ~150ms |
| OpenAI Whisper (real-time wrapper) | ~4% | ~10% | ~250ms |
| AssemblyAI Universal-2 | ~3.8% | ~9.5% | ~200ms |
| Google Speech-to-Text v2 | ~5% | ~12% | ~250ms |
| Microsoft Azure Speech | ~5% | ~12% | ~250ms |
Production Performance Metrics
| Metric | Outbound calling | Inbound qualification | Customer support |
|---|---|---|---|
| Hold rate (caller stays past intro) | ~35-55% | ~85-92% | ~92-97% |
| Task-completion rate | ~22-38% | ~62-74% | ~58-72% |
| Conversion vs human baseline | ~15-40% | ~70-95% | ~75-90% |
| CSAT vs human baseline | ~85-95% | ~92-100% | ~88-96% |
| Cost per call | ~$0.20-0.80 | ~$0.30-1.00 | ~$0.50-1.50 |
Outbound is the hardest use case (caller has not opted in, tolerance for AI is lower). Inbound and support, where the caller has chosen to engage, perform much closer to human baseline.
Production Pitfalls
- Barge-in handling: when caller interrupts the agent mid-sentence; quality varies by platform
- Background noise: WER degrades 2-3x in noisy environments; explicit noise suppression is increasingly standard
- Accent and dialect: WER 1.5-2x higher on non-standard accents; agent training data skew matters
- Phone-line audio quality: PSTN compression and codec artefacts degrade STT quality
- Disclosure compliance: many jurisdictions require AI disclosure to callers; compliance-aware platforms enforce this
Brand Visibility Implications
Voice AI agents are an emerging brand-recommendation surface. When a voice agent answers product questions during a sales call, the agent's recommendations shape buyer perceptions before any human rep involvement. Voice agent recommendation quality is directly tied to underlying LLM quality and the prompt structure built into the agent. Brands relevant to voice-agent decision contexts (sales tools, contact centre platforms, voice infrastructure providers) face material AI-mediated visibility on this surface.
Methodology
Latency from Vapi documentation, Retell AI, Synthflow, Bland AI public benchmarks. STT WER from Deepgram benchmarks and AssemblyAI research. Production performance from public case studies and Presenc AI deployment instrumentation across 25+ voice-agent enterprise deployments. Updated quarterly.
How Presenc AI Helps
Presenc AI's voice-agent observability captures brand-mention rates inside agent-customer conversations, surfacing how often products and competitors are recommended during voice calls. For brands operating in voice-agent-mediated buyer flows, this is the operational signal of brand exposure on a fast-growing AI surface.