Research

AI Hallucination Rate Benchmarks 2026

Public benchmark data for hallucination rates across major LLMs in 2026: Vectara HHEM, HaluEval, RAGTruth, and FACTS Grounding. By model, by task, with what the numbers actually mean.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

How Often Do AI Models Actually Hallucinate?

Hallucination rate is the most-cited and least-well-understood AI quality metric. Different benchmarks measure different phenomena: groundedness on summarisation, faithfulness in RAG, factuality on closed-book queries, citation accuracy. This page consolidates 2026 results across the four most-cited public benchmarks and explains what each number actually means.

Key Findings

  1. On Vectara's Hughes Hallucination Evaluation Model (HHEM) summarisation benchmark, frontier models in 2026 hallucinate on approximately 1.0-2.5 percent of summaries, down from 3-8 percent in 2023.
  2. On RAG-faithfulness benchmarks (RAGTruth), frontier-model hallucination rates are 4-9 percent, materially higher than summarisation, because RAG must integrate external context with internal knowledge.
  3. On closed-book factuality benchmarks (TruthfulQA, FACTS Grounding), frontier models score 80-90 percent accuracy in 2026, up from 50-65 percent in 2023; long-tail facts remain the dominant failure mode.
  4. Hallucination rates vary 5-15x by topic; long-tail facts (obscure historical, technical, or local information) hallucinate at 15-40 percent even on frontier models, while head-of-distribution facts hallucinate at 1-3 percent.
  5. Open-weight models trail frontier closed APIs by 2-5 percentage points on summarisation hallucination, less on RAG faithfulness.

Vectara HHEM Leaderboard (Summarisation Hallucination, May 2026)

ModelHallucination rateFactual consistency
GPT-5 Pro~1.0%~99.0%
Claude Opus 4.7~1.2%~98.8%
Claude Sonnet 4.6~1.5%~98.5%
Gemini 2.5 Pro~1.4%~98.6%
GPT-5~1.6%~98.4%
Llama 4 405B~2.3%~97.7%
Qwen 3 235B~2.5%~97.5%
Llama 4 70B~3.0%~97.0%
DeepSeek V4~3.4%~96.6%
Mistral Large 2~3.8%~96.2%

HHEM measures whether a summary contains content not supported by the source document. It is a reliable proxy for groundedness in summarisation but does not measure factual accuracy of closed-book queries.

RAG Faithfulness Benchmarks (RAGTruth-style, May 2026)

ModelFaithfulness error rateNotes
Claude Opus 4.7~4.2%Strong context adherence
GPT-5 Pro~4.6%Strong context adherence
Claude Sonnet 4.6~5.5%
Gemini 2.5 Pro~5.8%
Llama 4 405B~7.2%
Qwen 3 32B~8.4%
Llama 4 70B~9.1%

RAG faithfulness measures whether responses are grounded in retrieved context. Higher than summarisation hallucination because RAG models must integrate retrieved documents with parametric knowledge, creating opportunities for unsupported assertions.

Closed-Book Factuality (FACTS Grounding, TruthfulQA, May 2026)

ModelFACTS Grounding %TruthfulQA %
GPT-5 Pro~88~89
Claude Opus 4.7~87~91
Gemini 2.5 Pro~85~85
Llama 4 70B~76~78

FACTS Grounding (DeepMind) and TruthfulQA measure factual accuracy on test-set questions. Both are subject to training-data contamination concerns; treat absolute numbers cautiously.

Hallucination Rate by Topic Distribution

Topic classFrontier-model hallucination rate
High-frequency factual (top global cities, major historical events)~1-3%
Mid-frequency factual (mid-tier company facts, mid-tier historical)~5-12%
Long-tail factual (obscure historical, technical, local)~15-40%
Recent events (post-training-cutoff)~30-60%
Bibliographic (author names, paper titles, dates)~10-25%
Mathematical / numerical~3-8% on common; 15-30% on novel
Code (function signatures, library APIs)~5-15% on stable APIs; higher on changing libs

What Reduces Hallucination Rate

  • RAG with good retrieval: 50-80 percent hallucination reduction on factual queries when retrieval is high-quality
  • Function calling for facts: querying authoritative APIs for facts (calendar, calculation, search) avoids hallucination at the source
  • Confidence-calibrated abstention: models trained to say "I don't know" reduce false-confident hallucinations 2-5x
  • Citation-required output: models forced to cite reduce unsupported claims 30-60 percent
  • Reasoning mode: extended thinking reduces hallucination 15-30 percent on factual queries by surfacing self-doubts the model would otherwise paper over

Brand Visibility Implications

Hallucination rates are directly relevant to brand visibility: a 1.5 percent summarisation hallucination rate translates to roughly 15 in 1000 brand-mentioning responses containing fabricated content. For brands targeting AI-mediated discovery, hallucination is a brand-safety concern (false claims attributed to the brand) and a visibility concern (correct claims about competitors that should mention you). Brands should monitor hallucination rates inside their target AI surfaces; the difference between 1 percent and 5 percent hallucination affects thousands of impressions per million queries.

Methodology

HHEM scores from the Vectara hallucination leaderboard, RAGTruth from the RAGTruth repository, FACTS Grounding from DeepMind's benchmark, TruthfulQA from Lin et al. 2021. Topic-distribution figures triangulated from multiple academic studies and Presenc AI's evaluation infrastructure across enterprise customers. Benchmarks are subject to contamination concerns; treat relative ranking as more reliable than absolute numbers. Updated quarterly.

How Presenc AI Helps

Presenc AI tracks brand-specific hallucination rates inside AI assistant responses, distinguishing brand-fact-correct from brand-fact-fabricated answers across major AI platforms. For brands concerned with AI-mediated misinformation about their company, products, or executives, this is the operational signal that connects benchmark hallucination data to real brand exposure.

Frequently Asked Questions

On summarisation benchmarks, frontier models (GPT-5, Claude Opus 4.7, Gemini 2.5 Pro) hallucinate on approximately 1.0-2.5 percent of outputs. RAG-faithfulness errors are higher at 4-9 percent. Long-tail factual queries hallucinate at 15-40 percent even on frontier models. Hallucination is task-dependent; one number does not capture the phenomenon.
On Vectara HHEM summarisation, GPT-5 Pro at approximately 1.0 percent and Claude Opus 4.7 at approximately 1.2 percent are the leaders in May 2026. Differences below 0.5 percent are within benchmark noise; treat the top 3-4 models as roughly tied.
Better on common benchmarks. HHEM frontier-model hallucination dropped from 3-8 percent (2023) to 1.0-2.5 percent (2026). On long-tail facts and post-cutoff events, hallucination remains high, the trend is improvement on bench, slower improvement on tail.
In order of effectiveness: (1) RAG with high-quality retrieval (50-80% reduction on factual queries), (2) function calling to authoritative APIs for facts (eliminates source-of-truth hallucination), (3) citation-required output (30-60% reduction in unsupported claims), (4) reasoning mode (15-30% reduction on factual queries), (5) confidence-calibrated abstention training.
Directionally yes, absolutely with caveats. Benchmarks have known training-data contamination concerns, evaluator agreement issues, and topic-coverage gaps. Treat relative ranking as reliable, treat absolute percentages with appropriate uncertainty bands. Cross-reference multiple benchmarks rather than rely on any single number.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.