Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Rohaid Ali,Ian D Connolly,Deus Cielo,Wael F Asaad,Albert E Telfeian,Adetokunbo A Oyelese,Jared S Fridley,Ziya L Gokaslan,John H Shin,Patricia L Zadnik Sullivan,Curtis E Doberstein,Oliver Y Tang

doi:10.1227/neu.0000000000002551

Abstract

General large language models (LLMs), such as ChatGPT (GPT-3.5), have demonstrated the capability to pass multiple-choice medical board examinations. However, comparative accuracy of different LLMs and LLM performance on assessments of predominantly higher-order management questions is poorly understood. We aimed to assess the performance of 3 LLMs (GPT-3.5, GPT-4, and Google Bard) on a question bank designed specifically for neurosurgery oral boards examination preparation. The 149-question Self-Assessment Neurosurgery Examination Indications Examination was used to query LLM accuracy. Questions were inputted in a single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests assessed differences in performance by question characteristics. On a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% CI: 54.1%-70.1%) and 82.6% (95% CI: 75.2%-88.1%) of questions correctly, respectively. By contrast, Bard scored 44.2% (66/149, 95% CI: 36.2%-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both P < .01), and GPT-4 outperformed GPT-3.5 ( P = .023). Among 6 subspecialties, GPT-4 had significantly higher accuracy in the Spine category relative to GPT-3.5 and in 4 categories relative to Bard (all P < .01). Incorporation of higher-order problem solving was associated with lower question accuracy for GPT-3.5 (odds ratio [OR] = 0.80, P = .042) and Bard (OR = 0.76, P = .014), but not GPT-4 (OR = 0.86, P = .085). GPT-4's performance on imaging-related questions surpassed GPT-3.5's (68.6% vs 47.1%, P = .044) and was comparable with Bard's (68.6% vs 66.7%, P = 1.000). However, GPT-4 demonstrated significantly lower rates of "hallucination" on imaging-related questions than both GPT-3.5 (2.3% vs 57.1%, P < .001) and Bard (2.3% vs 27.3%, P = .002). Lack of question text description for questions predicted significantly higher odds of hallucination for GPT-3.5 (OR = 1.45, P = .012) and Bard (OR = 2.09, P < .001). On a question bank of predominantly higher-order management case scenarios for neurosurgery oral boards preparation, GPT-4 achieved a score of 82.6%, outperforming ChatGPT and Google Bard.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Abstract

Talk to us

Similar Papers

More From: Neurosurgery

Lead the way for us

Journal: Neurosurgery	Publication Date: Jun 12, 2023
Citations: 121

Similar Papers

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.
Rohaid Ali ... Albert E Telfeian
Neurosurgery | VOL. 93
Rohaid Ali, et. al.Rohaid Ali ... Albert E Telfeian
15 Aug 2023
Neurosurgery | VOL. 93

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... Bianca Maria Colosimo
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... Bianca Maria Colosimo
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Performance of Large Language Models on a Neurology Board–Style Examination
Marc Cicero Schubert ... Varun Venkataramani
JAMA network open | VOL. 6
Marc Cicero Schubert, et. al.Marc Cicero Schubert ... Varun Venkataramani
07 Dec 2023
JAMA network open | VOL. 6

Large Language Models Can Enable Inductive Thematic Analysis of a Social Media Corpus in a Single Prompt: Human Validation Study.
Michael S Deiner ... Urmimala Sarkar
JMIR infodemiology | VOL. 4
Michael S Deiner, et. al.Michael S Deiner ... Urmimala Sarkar
29 Aug 2024
JMIR infodemiology | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Abstract

Talk to us

Similar Papers

More From: Neurosurgery