Performance of Large Language Models on a Neurology Board–Style Examination

Marc Cicero Schubert,Wolfgang Wick,Varun Venkataramani

doi:10.1001/jamanetworkopen.2023.46721

Marc Cicero Schubert, Wolfgang Wick + Show 1 more

https://doi.org/10.1001/jamanetworkopen.2023.46721

Copy DOI

Abstract

Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. To assess the performance of LLMs on neurology board-style examinations. This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Overall percentage scores of 2 LLMs. LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: JAMA network open	Publication Date: Dec 7, 2023
Citations: 25	License type: cc-by

R Discovery Prime

R Discovery Prime

Performance of Large Language Models on a Neurology Board–Style Examination

Abstract

Talk to us

Similar Papers

More From: JAMA network open

Lead the way for us

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... Bianca Maria Colosimo
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... Bianca Maria Colosimo
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
Ivan Civettini ... Paola Perfetti
Blood | VOL. 142
Ivan Civettini, et. al.Ivan Civettini ... Paola Perfetti
02 Nov 2023
Blood | VOL. 142

Performance of Large Language Models on Medical Oncology Examination Questions
Jack B Longwell ... Rahul G Krishnan
JAMA Network Open | VOL. 7
Jack B Longwell, et. al.Jack B Longwell ... Rahul G Krishnan
18 Jun 2024
JAMA Network Open | VOL. 7

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition.
Yasin Celal Güneş ... Leman Günbey Karabekmez
Diagnostic and interventional radiology (Ankara, Turkey) | VOL. -
Yasin Celal Güneş, et. al.Yasin Celal Güneş ... Leman Günbey Karabekmez
09 Sep 2024
Diagnostic and interventional radiology (Ankara, Turkey) | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Performance of Large Language Models on a Neurology Board–Style Examination

Abstract

Talk to us

Similar Papers

More From: JAMA network open