A recent study by investigators from Mass General Brigham has shed light on the potential benefits and limitations of using large language models (LLMs), a type of generative AI, in drafting replies to patient messages. Published in The Lancet Digital Health, the findings underscore the importance of cautious implementation and vigilant oversight to ensure patient safety.
The rise in administrative tasks has contributed to physician burnout, prompting the adoption of generative AI algorithms by electronic health record (EHR) vendors to assist clinicians in drafting patient messages. Dr. Danielle Bitterman, corresponding author of the study, emphasized the potential of LLMs to reduce physician burden while improving patient education.
To assess the efficiency and safety of LLM-generated responses, researchers utilized OpenAI’s GPT-4 to generate scenarios about patients with cancer and corresponding questions. Radiation oncologists then manually responded to the queries, followed by GPT-4 generating responses. Interestingly, in 31% of cases, physicians believed that an LLM-generated response was authored by a human.
Physician-drafted responses were, on average, shorter than LLM-generated responses, with the latter including more educational content. While LLM-assistance improved perceived efficiency, there were concerns about patient safety. Unedited LLM-generated responses posed potential risks, including instances where urgent medical care instructions were lacking.
Despite these shortcomings, physicians found LLM-generated responses safe in 82.1% of cases and acceptable to send without further editing in 58.3% of cases. Notably, LLM-generated/physician-edited responses retained educational content, indicating its perceived value for patient education.
However, overreliance on LLMs may pose risks, given their demonstrated limitations. The study highlights the importance of human oversight and ongoing monitoring of LLM quality. Dr. Bitterman stressed the need for AI literacy among both patients and clinicians, alongside a better understanding of addressing LLM errors.
Moving forward, the researchers aim to investigate patient perceptions of LLM-based communications and how demographic characteristics influence LLM-generated responses. The study underscores the necessity of balanced AI integration in medicine, prioritizing patient safety while leveraging the efficiency benefits of AI assistance.