How often does medical AI hallucinate?

Depends on test conditions. Without mitigation prompts, leading models hallucinate on 64-67% of clinical vignettes. With mitigation prompts, this drops to 43-45%, and GPT-4o reaches as low as 23%. In production ChatGPT traffic, thinking-mode major-error rate is around 4.8%.

Are Med-PaLM and Med-Gemini safer than general LLMs?

Slightly. Domain-tuned models score better on MedQA, MedMCQA, and PubMedQA-style benchmarks, but they still hallucinate at meaningful rates, often from reasoning failures rather than knowledge gaps.

Which medical AI model leads HealthBench Hard?

Muse Spark at 42.8, ahead of GPT-5.4 (40.1). Gemini 3.1 Pro (20.6), Grok 4.2 (20.3), and Claude Sonnet 4.6 (14.8) all trail significantly on the hardest slice of OpenAI's clinician-grade benchmark.

How can a hospital reduce medical AI hallucinations?

Use mitigation prompts (explicit verification instructions), require citation grounding to UpToDate / PubMed, prefer thinking-mode or reasoning models, and audit AI-generated notes before they enter the patient record. Specialised ambient scribes (Abridge, Nuance DAX) carry lower hallucination rates than general LLMs but are not zero-risk.

Medical AI Hallucination Rates 2026: HealthBench, Clinical Vignettes, ChatGPT vs Med-PaLM

What this is

Medical AI hallucinations are the most safety-critical failure mode in the LLM literature, and 2026 has brought sharper benchmarks (HealthBench Professional from OpenAI, planted-error clinical vignette studies) and more nuanced rates. This page is a 2026-05-15 reference snapshot.

Hallucination Rates by Study

Study / model	Hallucination rate	Notes
2025 MedRxiv (300 vignettes, no mitigation, long cases)	64.1%	All models
2025 MedRxiv (with mitigation prompts)	43.1-45.3%	33% reduction
GPT-4o (without mitigation)	53%	Same study
GPT-4o (with mitigation)	23%	Same study, best-mitigated
Nature Comms Medicine (planted-error vignettes)	Up to 83%	Models elaborated on planted error
ChatGPT, production traffic, no thinking mode	11.6%	Major incorrect claims
ChatGPT, production traffic, thinking mode	4.8%	Major incorrect claims
AI-generated references with fabricated DOI / authors	45%+	Across multiple studies

HealthBench Professional (OpenAI, April 22, 2026)

Model	HealthBench Hard score
Muse Spark	42.8
GPT-5.4	40.1
Gemini 3.1 Pro	20.6
Grok 4.2	20.3
Claude Sonnet 4.6	14.8

Where Medical AI Goes Wrong

Repeats and elaborates planted errors. If a prompt contains a wrong fact, models propagate it in up to 83% of cases.
Fabricates citations. 45%+ of AI-generated references in tested studies had fabricated DOIs, authors, or publication dates.
Reasoning failures, not knowledge gaps. Even Med-PaLM and Med-Gemini, trained on biomedical corpora, fail clinical reasoning tasks at material rates.
Long cases hallucinate more than short. 64.1% vs 67.6% in the 2025 MedRxiv study without mitigation — long-context attention failures are real.
"Mitigation prompts" cut rates by ~33%. Adding explicit "verify before answering" prompts is the most reliable single intervention.
Thinking-mode models hit ~4.8% major-error rate in production ChatGPT traffic, a meaningful improvement over the 11.6% non-thinking baseline.

What This Means for AI Visibility

Healthcare brands, pharmaceutical companies, and medical device makers face a specific risk: if their product is mis-described inside ChatGPT, Claude, or Gemini, the misrepresentation may be repeated by clinicians using AI scribes (Abridge, Nuance DAX) or patient-facing agents (Hippocratic). Monitoring AI representation across both the general-purpose models and the clinical-AI surfaces that pull from them is essential.

Methodology

Rates aggregated from the 2025 MedRxiv clinical-vignette study (referenced in Suprmind's 2026 hallucination report), Nature npj Digital Medicine framework for clinical safety, the "Medical Hallucination in Foundation Models" arXiv preprint, iatroX 2026 medical hallucination examples, and PubMed Central on reference hallucinations. HealthBench Professional scores from OpenAI's April 22, 2026 release.

How Presenc AI Helps

Healthcare brands use Presenc AI to monitor how their drugs, devices, and protocols are described inside ChatGPT, Claude, Gemini, Perplexity, and increasingly Med-Gemini and Muse Spark. Mis-representation alerts flag dosing, indication, contraindication, and trial-outcome errors with the prompt that triggered them, so regulatory and content teams can correct upstream sources before clinicians or patients act on a wrong answer.

Medical AI Hallucination Rates 2026