0 0
Read Time:4 Minute, 10 Second

Artificial Intelligence (AI) models such as ChatGPT have become prominent in healthcare, offering promising solutions to alleviate clinician workload by triaging patients, taking medical histories, and even providing preliminary diagnoses. These tools, known as large-language models, are already being used by patients to better understand their symptoms and medical test results. But how well do these AI models perform in real-world situations, where human interaction and dynamic reasoning are essential?

According to a recent study led by researchers at Harvard Medical School and Stanford University, the answer is not very well—at least not yet.

Published on January 2 in Nature Medicine, the study introduces a new evaluation framework called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine). This innovative test was designed to assess how AI tools perform in realistic, patient-like conversations. The research team used CRAFT-MD to evaluate four different large-language models, both proprietary and open-source, to see how they would handle clinical interactions that closely mimic those in an actual medical setting.

The results were eye-opening. While the AI models performed admirably on medical exam-style questions, their ability to engage in natural, back-and-forth conversations—typical of real-world doctor-patient interactions—was significantly lower. The study emphasizes the need for a shift in how these models are tested and developed, pointing to a two-fold problem: AI tools must be assessed in more realistic environments, and their diagnostic abilities need to improve when interacting with patients in dynamic, unpredictable settings.

“Our work reveals a striking paradox,” said Pranav Rajpurkar, senior author of the study and assistant professor of biomedical informatics at Harvard Medical School. “While these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor’s visit. The dynamic nature of medical conversations—asking the right questions at the right time, piecing together scattered information, and reasoning through symptoms—poses unique challenges that go far beyond answering multiple-choice questions.”

In the study, the researchers used CRAFT-MD to test the AI models across 2,000 clinical vignettes covering conditions common in primary care and 12 medical specialties. The models often struggled to ask the right questions to gather relevant patient history, missed critical information, and had difficulty synthesizing scattered data. Their performance notably declined when presented with open-ended patient responses or when they were required to conduct conversations rather than simply answer isolated questions.

Shreya Johri, a doctoral student at Harvard Medical School and co-first author of the study, emphasized that current testing methods, which primarily involve answering multiple-choice questions based on medical exams, are inadequate for reflecting the messy, complex nature of real-world doctor-patient interactions. “We need a testing framework that mirrors reality more accurately,” Johri explained. “Such a framework would provide a better indication of how AI models would perform in practice.”

CRAFT-MD, developed as part of this research, is designed to do just that. The framework evaluates how well large-language models can collect patient information, including symptoms, medications, and family history, and then use that information to make a diagnosis. The tool utilizes one AI agent to simulate the patient and another to grade the AI’s diagnosis. Additionally, human experts assess each encounter based on the AI’s ability to gather relevant information, synthesize data, and maintain adherence to prompts.

Based on their findings, the research team offers several recommendations for improving AI’s real-world performance, including:

  • Using conversational, open-ended questions in training and testing AI tools to better reflect unstructured doctor-patient interactions.
  • Evaluating models for their ability to extract essential information through appropriate questioning.
  • Designing models that can handle multiple conversations and integrate diverse information.
  • Enhancing AI to interpret non-verbal cues such as facial expressions, tone of voice, and body language.
  • Incorporating both AI agents and human experts in evaluation processes to reduce labor costs and avoid exposing real patients to unverified AI models.

The researchers argue that incorporating such recommendations into AI development could help optimize these tools for clinical use. For instance, CRAFT-MD demonstrated its ability to process 10,000 conversations in just 48 to 72 hours, far outperforming human evaluators who would need hundreds of hours for similar evaluations.

“The goal is not to replace doctors but to augment clinical practice effectively and ethically,” said Roxana Daneshjou, co-senior author of the study and assistant professor of biomedical data science and dermatology at Stanford University. “CRAFT-MD is a step forward in testing AI models for health care, bringing us closer to tools that are truly ready for clinical use.”

As AI continues to advance, the researchers anticipate ongoing improvements to CRAFT-MD, ensuring that it stays aligned with the evolving capabilities of AI models and remains relevant in assessing their readiness for real-world applications.

For more information, refer to the original study published in Nature Medicine on January 2, 2025.


Source: Nature Medicine, DOI: 10.1038/s41591-024-03328-5

Happy
Happy
0 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %