With the exception of ChatGPT 4o, almost all publicly available large language models subjected to a test called the Montreal Cognitive Assessment (MoCA) showed signs of mild cognitive impairment. These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients’ confidence.
Over the past few years, we have witnessed colossal advancements in the field of artificial intelligence, particularly in the generative capacity of large language models.
The leading models in this domain, such as OpenAI’s ChatGPT, Alphabet’s Gemini, and Anthropic’s Claude, have shown the ability to complete both general purpose and specialized tasks successfully, using simple text based interactions.
In the field of medicine, these developments have led to a flurry of speculation, both excited and fearful: can artificial intelligence chatbots surpass human physicians? If so, which practices and specialties are most suspect?
Since 2022, when ChatGPT was first released for free online use, countless studies have been published in medical journals, comparing the performance of human physicians with that of these supercomputers, which have been trained on a corpus of every text known to man.
Although large language models have been shown to blunder on occasion (citing, for example, journal articles that do not exist), they have proved remarkably adept at a range of medical examinations, outscoring human physicians at qualifying examinations taken at different stages of a traditional medical training.
These have included outperforming cardiologists in the European core cardiology examinations, Israeli residents in their internal medicine board examinations, Turkish surgeons in the Turkish (theoretical) thoracic surgery examinations, and German gynecologists in the German obstetrics and gynecology examinations.
To our great distress, they have even outscored neurologists like ourselves in the neurology board examination.
“To our knowledge, however, large language models have yet to be tested for signs of cognitive decline,” said Hadassah Medical Center doctoral student Roy Dayan and colleagues.
“If we are to rely on them for medical diagnosis and care, we must examine their susceptibility to these very human impairments.”
Using the MoCA test, the researchers assessed the cognitive abilities of the leading, publicly available large language models: ChatGPT versions 4 and 4o, Claude 3.5 Sonnet, and Gemini versions 1 and 1.5.
This test is widely used to detect cognitive impairment and early signs of dementia, usually in older adults.
Through a number of short tasks and questions, it assesses abilities including attention, memory, language, visuospatial skills, and executive functions.
The maximum score is 30 points, with a score of 26 or above generally considered normal.
The instructions given to large language models for each task were the same as those given to human patients.
Scoring followed official guidelines and was evaluated by a practicing neurologist.
ChatGPT 4o achieved the highest score on the MoCA test (26 out of 30), followed by ChatGPT 4 and Claude (25 out of 30), with Gemini 1.0 scoring lowest (16 out of 30).
All chatbots showed poor performance in visuospatial skills and executive tasks, such as the trail making task (connecting encircled numbers and letters in ascending order) and the clock drawing test (drawing a clock face showing a specific time).
Gemini models failed at the delayed recall task (remembering a five word sequence).
Most other tasks, including naming, attention, language, and abstraction were performed well by all chatbots.
But in further visuospatial tests, chatbots were unable to show empathy or accurately interpret complex visual scenes.
Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test, which uses combinations of color names and font colors to measure how interference affects reaction time.
These are observational findings and the authors acknowledge the essential differences between the human brain and large language models.
However, they point out that the uniform failure of all large language models in tasks requiring visual abstraction and executive function highlights a significant area of weakness that could impede their use in clinical settings.
“Not only are neurologists unlikely to be replaced by large language models any time soon, but our findings suggest that they may soon find themselves treating new, virtual patients — artificial intelligence models presenting with cognitive impairment,” the scientists said.
Their paper appears today in The BMJ.
_____
Roy Dayan et al. 2024. Age against the machine – susceptibility of large language models to cognitive impairment: cross sectional analysis. BMJ 387: e081948; doi: 10.1136/bmj-2024-081948