Research

Voice AI Call Agent Benchmarks 2026

Production benchmarks for voice AI call agents in 2026: latency, word error rate, hold rates, conversion. Vapi, Synthflow, Retell AI, Bland AI, plus underlying providers (OpenAI Realtime, Cartesia, ElevenLabs, Deepgram).

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

The Voice AI Production Stack in 2026

Voice AI call agents moved from demo-quality to production-deployable in 2024-2025 as latency-optimised voice models (OpenAI Realtime, Cartesia Sonic, ElevenLabs Turbo) closed the round-trip latency gap. By 2026, voice agents handle inbound qualification, outbound prospecting, customer support, scheduling, and verification calls at meaningful production scale. This page consolidates production benchmarks.

Key Findings

  1. Frontier voice agent platforms (Vapi, Retell AI, Synthflow, Bland AI) deliver end-to-end latency of approximately 500-900ms in 2026, down from 1.5-2.5 seconds in 2023.
  2. Word error rate (WER) on production calls is approximately 4-8 percent on clear-line calls, materially worse on noisy lines, accents, or technical vocabulary.
  3. Voice-agent conversion rates on outbound qualification calls are 15-40 percent of human-baseline; inbound qualification is closer to parity with human reps.
  4. Hold rates (caller stays on the line through agent introduction) are approximately 35-55 percent for outbound calls, materially worse than human reps; inbound is comparable to human.
  5. Underlying provider stack matters: latency, voice quality, and barge-in handling differ meaningfully across OpenAI Realtime, Cartesia, ElevenLabs, and Deepgram.

End-to-End Latency Benchmarks

PlatformMedian latencyP95 latencyUnderlying stack
Vapi~500-700ms~1.2sConfigurable; commonly Cartesia + Deepgram + GPT-4o
Retell AI~600-800ms~1.4sConfigurable; Retell-tuned
Synthflow~700-900ms~1.5sConfigurable
Bland AI~500-700ms~1.3sCustom voice model
OpenAI Realtime API~400-600ms~1.0sEnd-to-end OpenAI stack
Direct Cartesia + Deepgram custom~400-600ms~0.9sCustom integration

Below 800ms median latency the conversation feels human; above 1 second the awkwardness becomes noticeable. Latency optimisation is the single dominant production focus in voice AI.

Voice Quality Comparisons

TTS ProviderLatency to first audio (ms)Voice quality (subjective)Cloning support
OpenAI TTS (Realtime)~150-250Very highNo
Cartesia Sonic~80-120HighYes (with consent)
ElevenLabs Turbo / Flash~150-250Very highYes
Deepgram Aura~100-200HighLimited
Google Cloud TTS~200-400Medium-highLimited
Open-source (Coqui, etc.)variableMediumYes

STT (Transcription) Quality on Live Calls

STT ProviderWER on clean callsWER on noisy callsStreaming latency
Deepgram Nova-3~3.5%~9%~150ms
OpenAI Whisper (real-time wrapper)~4%~10%~250ms
AssemblyAI Universal-2~3.8%~9.5%~200ms
Google Speech-to-Text v2~5%~12%~250ms
Microsoft Azure Speech~5%~12%~250ms

Production Performance Metrics

MetricOutbound callingInbound qualificationCustomer support
Hold rate (caller stays past intro)~35-55%~85-92%~92-97%
Task-completion rate~22-38%~62-74%~58-72%
Conversion vs human baseline~15-40%~70-95%~75-90%
CSAT vs human baseline~85-95%~92-100%~88-96%
Cost per call~$0.20-0.80~$0.30-1.00~$0.50-1.50

Outbound is the hardest use case (caller has not opted in, tolerance for AI is lower). Inbound and support, where the caller has chosen to engage, perform much closer to human baseline.

Production Pitfalls

  • Barge-in handling: when caller interrupts the agent mid-sentence; quality varies by platform
  • Background noise: WER degrades 2-3x in noisy environments; explicit noise suppression is increasingly standard
  • Accent and dialect: WER 1.5-2x higher on non-standard accents; agent training data skew matters
  • Phone-line audio quality: PSTN compression and codec artefacts degrade STT quality
  • Disclosure compliance: many jurisdictions require AI disclosure to callers; compliance-aware platforms enforce this

Brand Visibility Implications

Voice AI agents are an emerging brand-recommendation surface. When a voice agent answers product questions during a sales call, the agent's recommendations shape buyer perceptions before any human rep involvement. Voice agent recommendation quality is directly tied to underlying LLM quality and the prompt structure built into the agent. Brands relevant to voice-agent decision contexts (sales tools, contact centre platforms, voice infrastructure providers) face material AI-mediated visibility on this surface.

Methodology

Latency from Vapi documentation, Retell AI, Synthflow, Bland AI public benchmarks. STT WER from Deepgram benchmarks and AssemblyAI research. Production performance from public case studies and Presenc AI deployment instrumentation across 25+ voice-agent enterprise deployments. Updated quarterly.

How Presenc AI Helps

Presenc AI's voice-agent observability captures brand-mention rates inside agent-customer conversations, surfacing how often products and competitors are recommended during voice calls. For brands operating in voice-agent-mediated buyer flows, this is the operational signal of brand exposure on a fast-growing AI surface.

Frequently Asked Questions

Depends on use case. For low-latency outbound, Vapi and Bland AI lead at 500-700ms median. For inbound and support, Retell AI and Synthflow are competitive. For maximum control, direct integration of OpenAI Realtime + Cartesia + Deepgram is the lowest-latency stack at ~400-600ms. There is no single best.
Below 800ms median end-to-end latency conversations feel human; above 1 second awkwardness becomes noticeable. P95 latency under 1.5 seconds is the operational target for production deployments. Latency optimisation is the dominant focus in voice AI engineering in 2026.
For inbound qualification and support, yes, ~70-95 percent of human baseline. For outbound cold calling, materially worse, ~15-40 percent of human baseline because hold rates are lower and consumer tolerance for AI cold calls is limited. Outbound voice AI economics work despite lower conversion because cost per call is 5-20x lower than human rep cost.
Three components: STT (Deepgram, OpenAI Whisper, AssemblyAI), LLM (GPT-5, Claude, Gemini, Llama 4), TTS (OpenAI, Cartesia, ElevenLabs, Deepgram Aura). Some platforms wrap end-to-end (OpenAI Realtime). Stack choice substantially affects latency and voice quality.
Yes, varies by jurisdiction. Some US states (Florida, Texas), Canada, and California require AI disclosure to callers. EU AI Act treats deepfakes and biometric AI specifically. Industry-specific rules apply (healthcare, financial services). Compliance-aware platforms enforce disclosure; ad-hoc deployments often do not.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.