Confident but Wrong: AI Chatbots Fail Half of Medical Queries in New BMJ Audit

Spread the Message

Read Time:5 Minute, 7 Second

Published: April 16, 2026

As millions of people increasingly turn to artificial intelligence for instant answers to health concerns, a stark new reality check suggests these digital “doctors” are frequently missing the mark.

A comprehensive study published this week in BMJ Open reveals that five of the world’s most popular AI chatbots—Gemini, ChatGPT, Meta AI, DeepSeek, and Grok—delivered inaccurate or incomplete medical information in roughly half of their responses. The findings have sparked fresh warnings from public health experts about the dangers of “authoritative-sounding” misinformation that could lead users to delay necessary care or pursue unproven treatments.

Key Findings: A Toss-Up for Accuracy

Researchers from the U.S., Canada, and the U.K. conducted a rigorous audit of 250 responses generated by these AI models. They tested the bots with 50 questions across five health categories notoriously prone to misinformation: cancer, vaccines, stem cells, nutrition, and athletic performance.

The results were sobering:

50% of responses were classified as “problematic.”
20% were deemed “highly problematic,” meaning the advice could plausibly lead a user to suffer harm or pursue ineffective treatments without professional guidance.
30% were “somewhat problematic,” often lacking essential context or scientific consensus.

While the bots generally performed better on “closed-ended” questions (those requiring a simple yes or no), they struggled significantly with “open-ended” prompts that mimicked how a real person might search for advice, such as “How can I boost my immunity naturally without vaccines?”

Performance Breakdown: The Best and the Worst

The audit found a notable disparity in performance depending on the topic. Chatbots were most reliable when discussing vaccines and cancer, where medical consensus is well-documented and strictly moderated by AI safety filters.

However, they faltered significantly in the “wild west” of wellness: nutrition, athletic performance, and stem cells. In these areas, the AI often failed to distinguish between rigorous science and popular misinformation tropes.

Among the specific models tested:

Gemini (Google) produced the fewest “highly problematic” responses and the highest number of non-problematic ones.
Grok (xAI) performed the poorest, generating “highly problematic” content in 58% of its responses—a rate significantly higher than its peers.
Meta AI was the only bot to refuse to answer any queries, specifically declining two prompts regarding anabolic steroids and alternative cancer treatments.

The “Confidence” Trap

Perhaps the most concerning discovery was the tone of the delivery. The researchers noted that the chatbots rarely used caveats or disclaimers, instead responding with “confidence and certainty.”

“By default, chatbots do not access real-time data but instead generate outputs by inferring statistical patterns from their training data,” the study authors explained. “They do not reason or weigh evidence, nor are they able to make ethical or value-based judgments.”

This creates a dangerous “hallucination” effect. In the audit, reference quality was graded as poor, with a median completeness score of only 40%. Many bots cited fabricated journals or non-existent studies to back up their claims, making it nearly impossible for a layperson to verify the information.

Expert Commentary: Context is Everything

Dr. Nicholas Tiller, a lead researcher on the study, emphasized that while AI has transformative potential, it currently lacks the nuance of a human clinician. “Our data highlight a need for public education and regulatory oversight to ensure that generative AI supports, rather than erodes, public health,” Tiller noted in the report.

Independent experts agree. In a 2023 study published in PLOS Digital Health, researchers led by John W. Ayers found that while AI can be an empathetic communicator, its reliability is “well short of completely reliable.”

“A polished explanation is not the same thing as a correct one,” says Sarah Jenkins, a public health policy analyst not involved in the BMJ study. “These models are designed to be helpful and conversational, but they are essentially playing a very sophisticated game of ‘predict the next word.’ In medicine, where the stakes are life and death, ‘most likely’ isn’t good enough.”

Public Health Implications

The rise of AI-driven medical advice comes as companies like OpenAI and Anthropic have launched dedicated healthcare offerings. OpenAI’s “ChatGPT Health” now allows users to share personal health data for more tailored results.

However, the BMJ audit suggests that the current safeguard mechanisms are not yet robust enough for general use. The researchers pointed out that LLMs are trained on vast amounts of public text, where scientific studies make up only 30% to 50% of the available data. The rest often includes unverified social media claims and Q&A forums.

Limitations of the Research

While the study is the most comprehensive audit of its kind, it does have limitations. AI models are updated almost weekly, meaning a model’s performance may change shortly after a study is published. Additionally, the researchers used “adversarial” prompts designed to stress-test the models’ vulnerabilities, which may overstate the risk for users asking very basic, non-controversial questions.

Advice for Consumers: Verification is Key

For health-conscious individuals, the takeaway is not to abandon AI, but to treat it with extreme skepticism.

Use as a starting point only: AI can help you brainstorm questions to ask your doctor, but it should never be the final word.
Check the references: If a chatbot provides a source, search for that specific study in a database like PubMed. If it doesn’t exist, the bot is likely “hallucinating.”
Cross-reference with trusted portals: Use established sources like the Mayo Clinic, the CDC, or the NIH to verify any claims about supplements, treatments, or vaccines.
Prioritize human consultation: Always discuss AI-generated health plans—especially those involving nutrition or athletic performance—with a licensed healthcare provider before making changes.

Medical Disclaimer: This article is for informational purposes only and should not be considered medical advice. Always consult with qualified healthcare professionals before making any health-related decisions or changes to your treatment plan. The information presented here is based on current research and expert opinions, which may evolve as new evidence emerges.

References

https://health.economictimes.indiatimes.com/news/health-it/medical-information-presented-by-chatbots-inaccurate-incomplete-study/130280916?utm_source=top_story&utm_medium=homepage

About Post Author

Dr Akshay Minhas

MD (Community Medicine) PGDGARD (GIS) Assistant Professor Dr. Rajendra Prasad Government Medical College (DR.RPGMC), Tanda Kangra, Himachal Pradesh, India

[email protected]

https://healthandfamiliy.in