How Often Do AI Models Actually Hallucinate?
Hallucination rate is the most-cited and least-well-understood AI quality metric. Different benchmarks measure different phenomena: groundedness on summarisation, faithfulness in RAG, factuality on closed-book queries, citation accuracy. This page consolidates 2026 results across the four most-cited public benchmarks and explains what each number actually means.
Key Findings
- On Vectara's Hughes Hallucination Evaluation Model (HHEM) summarisation benchmark, frontier models in 2026 hallucinate on approximately 1.0-2.5 percent of summaries, down from 3-8 percent in 2023.
- On RAG-faithfulness benchmarks (RAGTruth), frontier-model hallucination rates are 4-9 percent, materially higher than summarisation, because RAG must integrate external context with internal knowledge.
- On closed-book factuality benchmarks (TruthfulQA, FACTS Grounding), frontier models score 80-90 percent accuracy in 2026, up from 50-65 percent in 2023; long-tail facts remain the dominant failure mode.
- Hallucination rates vary 5-15x by topic; long-tail facts (obscure historical, technical, or local information) hallucinate at 15-40 percent even on frontier models, while head-of-distribution facts hallucinate at 1-3 percent.
- Open-weight models trail frontier closed APIs by 2-5 percentage points on summarisation hallucination, less on RAG faithfulness.
Vectara HHEM Leaderboard (Summarisation Hallucination, May 2026)
| Model | Hallucination rate | Factual consistency |
|---|---|---|
| GPT-5 Pro | ~1.0% | ~99.0% |
| Claude Opus 4.7 | ~1.2% | ~98.8% |
| Claude Sonnet 4.6 | ~1.5% | ~98.5% |
| Gemini 2.5 Pro | ~1.4% | ~98.6% |
| GPT-5 | ~1.6% | ~98.4% |
| Llama 4 405B | ~2.3% | ~97.7% |
| Qwen 3 235B | ~2.5% | ~97.5% |
| Llama 4 70B | ~3.0% | ~97.0% |
| DeepSeek V4 | ~3.4% | ~96.6% |
| Mistral Large 2 | ~3.8% | ~96.2% |
HHEM measures whether a summary contains content not supported by the source document. It is a reliable proxy for groundedness in summarisation but does not measure factual accuracy of closed-book queries.
RAG Faithfulness Benchmarks (RAGTruth-style, May 2026)
| Model | Faithfulness error rate | Notes |
|---|---|---|
| Claude Opus 4.7 | ~4.2% | Strong context adherence |
| GPT-5 Pro | ~4.6% | Strong context adherence |
| Claude Sonnet 4.6 | ~5.5% | |
| Gemini 2.5 Pro | ~5.8% | |
| Llama 4 405B | ~7.2% | |
| Qwen 3 32B | ~8.4% | |
| Llama 4 70B | ~9.1% |
RAG faithfulness measures whether responses are grounded in retrieved context. Higher than summarisation hallucination because RAG models must integrate retrieved documents with parametric knowledge, creating opportunities for unsupported assertions.
Closed-Book Factuality (FACTS Grounding, TruthfulQA, May 2026)
| Model | FACTS Grounding % | TruthfulQA % |
|---|---|---|
| GPT-5 Pro | ~88 | ~89 |
| Claude Opus 4.7 | ~87 | ~91 |
| Gemini 2.5 Pro | ~85 | ~85 |
| Llama 4 70B | ~76 | ~78 |
FACTS Grounding (DeepMind) and TruthfulQA measure factual accuracy on test-set questions. Both are subject to training-data contamination concerns; treat absolute numbers cautiously.
Hallucination Rate by Topic Distribution
| Topic class | Frontier-model hallucination rate |
|---|---|
| High-frequency factual (top global cities, major historical events) | ~1-3% |
| Mid-frequency factual (mid-tier company facts, mid-tier historical) | ~5-12% |
| Long-tail factual (obscure historical, technical, local) | ~15-40% |
| Recent events (post-training-cutoff) | ~30-60% |
| Bibliographic (author names, paper titles, dates) | ~10-25% |
| Mathematical / numerical | ~3-8% on common; 15-30% on novel |
| Code (function signatures, library APIs) | ~5-15% on stable APIs; higher on changing libs |
What Reduces Hallucination Rate
- RAG with good retrieval: 50-80 percent hallucination reduction on factual queries when retrieval is high-quality
- Function calling for facts: querying authoritative APIs for facts (calendar, calculation, search) avoids hallucination at the source
- Confidence-calibrated abstention: models trained to say "I don't know" reduce false-confident hallucinations 2-5x
- Citation-required output: models forced to cite reduce unsupported claims 30-60 percent
- Reasoning mode: extended thinking reduces hallucination 15-30 percent on factual queries by surfacing self-doubts the model would otherwise paper over
Brand Visibility Implications
Hallucination rates are directly relevant to brand visibility: a 1.5 percent summarisation hallucination rate translates to roughly 15 in 1000 brand-mentioning responses containing fabricated content. For brands targeting AI-mediated discovery, hallucination is a brand-safety concern (false claims attributed to the brand) and a visibility concern (correct claims about competitors that should mention you). Brands should monitor hallucination rates inside their target AI surfaces; the difference between 1 percent and 5 percent hallucination affects thousands of impressions per million queries.
Methodology
HHEM scores from the Vectara hallucination leaderboard, RAGTruth from the RAGTruth repository, FACTS Grounding from DeepMind's benchmark, TruthfulQA from Lin et al. 2021. Topic-distribution figures triangulated from multiple academic studies and Presenc AI's evaluation infrastructure across enterprise customers. Benchmarks are subject to contamination concerns; treat relative ranking as more reliable than absolute numbers. Updated quarterly.
How Presenc AI Helps
Presenc AI tracks brand-specific hallucination rates inside AI assistant responses, distinguishing brand-fact-correct from brand-fact-fabricated answers across major AI platforms. For brands concerned with AI-mediated misinformation about their company, products, or executives, this is the operational signal that connects benchmark hallucination data to real brand exposure.