Comparative Analysis of Performance of Large Language Models in Urogynecology.

Ghanshyam S Yadav,Kshitij Pandit,Phillip T Connell,Hadi Erfani,Charles W Nager

doi:10.1097/spv.0000000000001545

Abstract

Despite growing popularity in medicine, data on large language models in urogynecology are lacking. The aim of this study was to compare the performance of ChatGPT-3.5, GPT-4, and Bard on the American Urogynecologic Society self-assessment examination. The examination features 185 questions with a passing score of 80. We tested 3 models-ChatGPT-3.5, GPT-4, and Bard on every question. Dedicated accounts enabled controlled comparisons. Questions with prompts were inputted into each model's interface, and responses were evaluated for correctness, logical reasoning behind answer choice, and sourcing. Data on subcategory, question type, correctness rate, question difficulty, and reference quality were noted. The Fisher exact or χ2 test was used for statistical analysis. Out of 185 questions, GPT-4 answered 61.6% questions correctly compared with 54.6% for GPT-3.5 and 42.7% for Bard. GPT-4 answered all questions, whereas GPT-3.5 and Bard declined to answer 4 and 25 questions, respectively. All models demonstrated logical reasoning in their correct responses. Performance of all large language models was inversely proportional to the difficulty level of the questions. Bard referenced sources 97.5% of the time, more often than GPT-4 (83.3%) and GPT-3.5 (39%). GPT-3.5 cited books and websites, whereas GPT-4 and Bard additionally cited journal articles and society guidelines. Median journal impact factor and number of citations were 3.6 with 20 citations for GPT-4 and 2.6 with 25 citations for Bard. Although GPT-4 outperformed GPT-3.5 and Bard, none of the models achieved a passing score. Clinicians should use language models cautiously in patient care scenarios until more evidence emerges.

Full Text