What this is
Medical AI hallucinations are the most safety-critical failure mode in the LLM literature, and 2026 has brought sharper benchmarks (HealthBench Professional from OpenAI, planted-error clinical vignette studies) and more nuanced rates. This page is a 2026-05-15 reference snapshot.
Hallucination Rates by Study
| Study / model | Hallucination rate | Notes |
|---|---|---|
| 2025 MedRxiv (300 vignettes, no mitigation, long cases) | 64.1% | All models |
| 2025 MedRxiv (with mitigation prompts) | 43.1-45.3% | 33% reduction |
| GPT-4o (without mitigation) | 53% | Same study |
| GPT-4o (with mitigation) | 23% | Same study, best-mitigated |
| Nature Comms Medicine (planted-error vignettes) | Up to 83% | Models elaborated on planted error |
| ChatGPT, production traffic, no thinking mode | 11.6% | Major incorrect claims |
| ChatGPT, production traffic, thinking mode | 4.8% | Major incorrect claims |
| AI-generated references with fabricated DOI / authors | 45%+ | Across multiple studies |
HealthBench Professional (OpenAI, April 22, 2026)
| Model | HealthBench Hard score |
|---|---|
| Muse Spark | 42.8 |
| GPT-5.4 | 40.1 |
| Gemini 3.1 Pro | 20.6 |
| Grok 4.2 | 20.3 |
| Claude Sonnet 4.6 | 14.8 |
Where Medical AI Goes Wrong
- Repeats and elaborates planted errors. If a prompt contains a wrong fact, models propagate it in up to 83% of cases.
- Fabricates citations. 45%+ of AI-generated references in tested studies had fabricated DOIs, authors, or publication dates.
- Reasoning failures, not knowledge gaps. Even Med-PaLM and Med-Gemini, trained on biomedical corpora, fail clinical reasoning tasks at material rates.
- Long cases hallucinate more than short. 64.1% vs 67.6% in the 2025 MedRxiv study without mitigation — long-context attention failures are real.
- "Mitigation prompts" cut rates by ~33%. Adding explicit "verify before answering" prompts is the most reliable single intervention.
- Thinking-mode models hit ~4.8% major-error rate in production ChatGPT traffic, a meaningful improvement over the 11.6% non-thinking baseline.
What This Means for AI Visibility
Healthcare brands, pharmaceutical companies, and medical device makers face a specific risk: if their product is mis-described inside ChatGPT, Claude, or Gemini, the misrepresentation may be repeated by clinicians using AI scribes (Abridge, Nuance DAX) or patient-facing agents (Hippocratic). Monitoring AI representation across both the general-purpose models and the clinical-AI surfaces that pull from them is essential.
Methodology
Rates aggregated from the 2025 MedRxiv clinical-vignette study (referenced in Suprmind's 2026 hallucination report), Nature npj Digital Medicine framework for clinical safety, the "Medical Hallucination in Foundation Models" arXiv preprint, iatroX 2026 medical hallucination examples, and PubMed Central on reference hallucinations. HealthBench Professional scores from OpenAI's April 22, 2026 release.
How Presenc AI Helps
Healthcare brands use Presenc AI to monitor how their drugs, devices, and protocols are described inside ChatGPT, Claude, Gemini, Perplexity, and increasingly Med-Gemini and Muse Spark. Mis-representation alerts flag dosing, indication, contraindication, and trial-outcome errors with the prompt that triggered them, so regulatory and content teams can correct upstream sources before clinicians or patients act on a wrong answer.