Research

Medical AI Hallucination Rates 2026

Medical AI hallucinations 2026: 64-83% on planted-error clinical vignettes, 4.8% ChatGPT major-error rate with reasoning. HealthBench Hard top: Muse Spark 42.8. Snapshot for 2026-05-15.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

What this is

Medical AI hallucinations are the most safety-critical failure mode in the LLM literature, and 2026 has brought sharper benchmarks (HealthBench Professional from OpenAI, planted-error clinical vignette studies) and more nuanced rates. This page is a 2026-05-15 reference snapshot.

Hallucination Rates by Study

Study / modelHallucination rateNotes
2025 MedRxiv (300 vignettes, no mitigation, long cases)64.1%All models
2025 MedRxiv (with mitigation prompts)43.1-45.3%33% reduction
GPT-4o (without mitigation)53%Same study
GPT-4o (with mitigation)23%Same study, best-mitigated
Nature Comms Medicine (planted-error vignettes)Up to 83%Models elaborated on planted error
ChatGPT, production traffic, no thinking mode11.6%Major incorrect claims
ChatGPT, production traffic, thinking mode4.8%Major incorrect claims
AI-generated references with fabricated DOI / authors45%+Across multiple studies

HealthBench Professional (OpenAI, April 22, 2026)

ModelHealthBench Hard score
Muse Spark42.8
GPT-5.440.1
Gemini 3.1 Pro20.6
Grok 4.220.3
Claude Sonnet 4.614.8

Where Medical AI Goes Wrong

  1. Repeats and elaborates planted errors. If a prompt contains a wrong fact, models propagate it in up to 83% of cases.
  2. Fabricates citations. 45%+ of AI-generated references in tested studies had fabricated DOIs, authors, or publication dates.
  3. Reasoning failures, not knowledge gaps. Even Med-PaLM and Med-Gemini, trained on biomedical corpora, fail clinical reasoning tasks at material rates.
  4. Long cases hallucinate more than short. 64.1% vs 67.6% in the 2025 MedRxiv study without mitigation — long-context attention failures are real.
  5. "Mitigation prompts" cut rates by ~33%. Adding explicit "verify before answering" prompts is the most reliable single intervention.
  6. Thinking-mode models hit ~4.8% major-error rate in production ChatGPT traffic, a meaningful improvement over the 11.6% non-thinking baseline.

What This Means for AI Visibility

Healthcare brands, pharmaceutical companies, and medical device makers face a specific risk: if their product is mis-described inside ChatGPT, Claude, or Gemini, the misrepresentation may be repeated by clinicians using AI scribes (Abridge, Nuance DAX) or patient-facing agents (Hippocratic). Monitoring AI representation across both the general-purpose models and the clinical-AI surfaces that pull from them is essential.

Methodology

Rates aggregated from the 2025 MedRxiv clinical-vignette study (referenced in Suprmind's 2026 hallucination report), Nature npj Digital Medicine framework for clinical safety, the "Medical Hallucination in Foundation Models" arXiv preprint, iatroX 2026 medical hallucination examples, and PubMed Central on reference hallucinations. HealthBench Professional scores from OpenAI's April 22, 2026 release.

How Presenc AI Helps

Healthcare brands use Presenc AI to monitor how their drugs, devices, and protocols are described inside ChatGPT, Claude, Gemini, Perplexity, and increasingly Med-Gemini and Muse Spark. Mis-representation alerts flag dosing, indication, contraindication, and trial-outcome errors with the prompt that triggered them, so regulatory and content teams can correct upstream sources before clinicians or patients act on a wrong answer.

Frequently Asked Questions

Depends on test conditions. Without mitigation prompts, leading models hallucinate on 64-67% of clinical vignettes. With mitigation prompts, this drops to 43-45%, and GPT-4o reaches as low as 23%. In production ChatGPT traffic, thinking-mode major-error rate is around 4.8%.
Slightly. Domain-tuned models score better on MedQA, MedMCQA, and PubMedQA-style benchmarks, but they still hallucinate at meaningful rates, often from reasoning failures rather than knowledge gaps.
Muse Spark at 42.8, ahead of GPT-5.4 (40.1). Gemini 3.1 Pro (20.6), Grok 4.2 (20.3), and Claude Sonnet 4.6 (14.8) all trail significantly on the hardest slice of OpenAI's clinician-grade benchmark.
Use mitigation prompts (explicit verification instructions), require citation grounding to UpToDate / PubMed, prefer thinking-mode or reasoning models, and audit AI-generated notes before they enter the patient record. Specialised ambient scribes (Abridge, Nuance DAX) carry lower hallucination rates than general LLMs but are not zero-risk.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.