December 19, 2024 – In a study published in the Christmas issue of the BMJ, researchers found that nearly all leading large language models (LLMs), also known as AI chatbots, exhibit signs of mild cognitive impairment in tests commonly used to detect early dementia. These findings challenge the belief that AI may soon replace human doctors, especially in the field of medicine.
With the rapid advancements in artificial intelligence, much speculation has swirled around whether chatbots could eventually surpass human physicians in medical diagnostics. While previous studies have shown that LLMs perform well in many areas of medicine, their susceptibility to cognitive decline had not been thoroughly explored until now.
To investigate this, the researchers assessed the cognitive abilities of the leading publicly available LLMs, including ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet). The team used the Montreal Cognitive Assessment (MoCA) test, which is commonly used to detect cognitive impairment and early signs of dementia in adults. The MoCA test measures various cognitive functions, including attention, memory, language, and executive abilities, with a score of 26 or above generally considered normal.
ChatGPT 4o achieved the highest score with 26 out of 30, followed closely by ChatGPT 4 and Claude, both scoring 25. In contrast, Gemini 1.0 scored the lowest with just 16 points.
Notably, all of the chatbots struggled with tasks involving visuospatial skills and executive functions—such as the trail-making task (which involves connecting numbers and letters in a specified order) and the clock-drawing task (which requires drawing a clock face showing a specific time). The Gemini models also failed to perform the delayed recall task, which asks participants to remember a sequence of five words.
Despite these challenges, the chatbots performed well on tasks that assessed attention, language, and abstraction, demonstrating their advanced capabilities in certain cognitive domains. However, their difficulty with visual and abstract tasks could hinder their practical use in clinical settings, where such skills are critical.
In one particular test—the Stroop test, which measures how well a person can manage interference in cognitive tasks—only ChatGPT 4o succeeded in the incongruent stage, where color names and font colors conflict. This highlighted another limitation in the ability of AI models to handle complex, conflicting stimuli.
The researchers emphasize that these findings are observational, and the differences between the human brain and AI models are significant. Nevertheless, the uniform weaknesses displayed by all the chatbots in key areas of cognition suggest that LLMs may not be ready to replace human neurologists or other healthcare professionals anytime soon. In fact, the study raises the possibility that neurologists may soon be tasked with diagnosing cognitive impairment in artificial intelligence models themselves.
“The future of AI in medicine is uncertain,” the authors conclude. “Not only are neurologists unlikely to be replaced by large language models any time soon, but our findings suggest that they may soon find themselves treating new, virtual patients—AI models presenting with cognitive impairment.”
The research sheds light on a critical, previously unexplored aspect of AI technology, calling attention to the limitations of LLMs in performing tasks requiring higher-level cognitive functions and complex decision-making. While AI shows promise in many areas of healthcare, its role in replacing human physicians remains uncertain.
For more information, refer to the full study in BMJ (2024). DOI: 10.1136/bmj-2024-081948.