New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology.

Linda My Huynh,Benjamin T Bonebrake,Christopher M Deibert,Alan Quach,Kaitlyn Schultis

doi:10.1097/upj.0000000000000406

Abstract

Large language models have demonstrated impressive capabilities, but application to medicine remains unclear. We seek to evaluate the use of ChatGPT on the American Urological Association Self-assessment Study Program as an educational adjunct for urology trainees and practicing physicians. One hundred fifty questions from the 2022 Self-assessment Study Program exam were screened, and those containing visual assets (n=15) were removed. The remaining items were encoded as open ended or multiple choice. ChatGPT's output was coded as correct, incorrect, or indeterminate; if indeterminate, responses were regenerated up to 2 times. Concordance, quality, and accuracy were ascertained by 3 independent researchers and reviewed by 2 physician adjudicators. A new session was started for each entry to avoid crossover learning. ChatGPT was correct on 36/135 (26.7%) open-ended and 38/135 (28.2%) multiple-choice questions. Indeterminate responses were generated in 40 (29.6%) and 4 (3.0%), respectively. Of the correct responses, 24/36 (66.7%) and 36/38 (94.7%) were on initial output, 8 (22.2%) and 1 (2.6%) on second output, and 4 (11.1%) and 1 (2.6%) on final output, respectively. Although regeneration decreased indeterminate responses, proportion of correct responses did not increase. For open-ended and multiple-choice questions, ChatGPT provided consistent justifications for incorrect answers and remained concordant between correct and incorrect answers. ChatGPT previously demonstrated promise on medical licensing exams; however, application to the 2022 Self-assessment Study Program was not demonstrated. Performance improved with multiple-choice over open-ended questions. More importantly were the persistent justifications for incorrect responses-left unchecked, utilization of ChatGPT in medicine may facilitate medical misinformation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology.

Abstract

Talk to us

Similar Papers

More From: Urology Practice

Lead the way for us

Journal: Urology Practice	Publication Date: Jun 5, 2023
Citations: 35

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... Bianca Maria Colosimo
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... Bianca Maria Colosimo
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments
Brendin R Beaulieu-Jones ... Gabriel A Brat
Surgery | VOL. 175
Brendin R Beaulieu-Jones, et. al.Brendin R Beaulieu-Jones ... Gabriel A Brat
20 Jan 2024
Surgery | VOL. 175

Exploring Capabilities of Large Language Models such as ChatGPT in Radiation Oncology
Fabio Dennstädt ... Nikola Cihoric
Advances in radiation oncology | VOL. 9
Fabio Dennstädt, et. al.Fabio Dennstädt ... Nikola Cihoric
04 Nov 2023
Advances in radiation oncology | VOL. 9

Socrative (Snowy release)
Nicole Nawalaniec
Journal of the Medical Library Association : JMLA | VOL. 103
Nicole NawalaniecNicole Nawalaniec
01 Oct 2015
Journal of the Medical Library Association : JMLA | VOL. 103

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology.

Abstract

Talk to us

Similar Papers

More From: Urology Practice