Evaluating the Efficacy of AI Chatbots as Tutors in Urology: A Comparative Analysis of Responses to the 2022 In-Service Assessment of the European Board of Urology

Katharina Körner-Riffard,Klaus Eredics,Klaus Eredics,Lisa Kollitsch,Martin Marszalek,Matthias May,Maximilian Burger,Michael Rauchenwald,Michael Rauchenwald,Sabine D Brookman-May,Sabine D Brookman-May

doi:10.1159/000537854

Abstract

Introduction: This study assessed the potential of large language models (LLMs) as educational tools by evaluating their accuracy in answering questions across urological subtopics. Methods: Three LLMs (ChatGPT-3.5, ChatGPT-4, and Bing AI) were examined in two testing rounds, separated by 48 h, using 100 Multiple-Choice Questions (MCQs) from the 2022 European Board of Urology (EBU) In-Service Assessment (ISA), covering five different subtopics. The correct answer was defined as “formal accuracy” (FA) representing the designated single best answer (SBA) among four options. Alternative answers selected from LLMs, which may not necessarily be the SBA but are still deemed correct, were labeled as “extended accuracy” (EA). Their capacity to enhance the overall accuracy rate when combined with FA was examined. Results: In two rounds of testing, the FA scores were achieved as follows: ChatGPT-3.5: 58% and 62%, ChatGPT-4: 63% and 77%, and BING AI: 81% and 73%. The incorporation of EA did not yield a significant enhancement in overall performance. The achieved gains for ChatGPT-3.5, ChatGPT-4, and BING AI were as a result 7% and 5%, 5% and 2%, and 3% and 1%, respectively (p > 0.3). Within urological subtopics, LLMs showcased best performance in Pediatrics/Congenital and comparatively less effectiveness in Functional/BPS/Incontinence. Conclusion: LLMs exhibit suboptimal urology knowledge and unsatisfactory proficiency for educational purposes. The overall accuracy did not significantly improve when combining EA to FA. The error rates remained high ranging from 16 to 35%. Proficiency levels vary substantially across subtopics. Further development of medicine-specific LLMs is required before integration into urological training programs.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evaluating the Efficacy of AI Chatbots as Tutors in Urology: A Comparative Analysis of Responses to the 2022 In-Service Assessment of the European Board of Urology

Abstract

Talk to us

Similar Papers

More From: Urologia Internationalis

Lead the way for us

Journal: Urologia Internationalis	Publication Date: Mar 30, 2024
License type: CC BY 4.0

Similar Papers

Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment.
U Hin Lai ... Jessie Kai Ching Kan
Frontiers in medicine | VOL. 10
U Hin Lai, et. al.U Hin Lai ... Jessie Kai Ching Kan
19 Sep 2023
Frontiers in medicine | VOL. 10

Item Difficulty and Discrimination Index in Single Best Answer MCQ: End of Semester Examinations at Taylor’s Clinical School
Keng Yin Loh ... Ihab Elsayed
-
Keng Yin Loh, et. al.Keng Yin Loh ... Ihab Elsayed
05 Aug 2017
05 Aug 2017

Reviews of Educational Material
Emira Kursumovic ... Joseph E Arrowsmith
Anesthesiology | VOL. 127
Emira Kursumovic, et. al.Emira Kursumovic ... Joseph E Arrowsmith
01 Oct 2017
Anesthesiology | VOL. 127

The Introduction of Single Best Answer Questions as a Test of Knowledge in the Final Examination for the Fellowship of the Royal College of Radiologists in Clinical Oncology
L.T Tan ... J.J.A Mcaleer
Clinical Oncology | VOL. 20
L.T Tan, et. al.L.T Tan ... J.J.A Mcaleer
26 Jun 2008
Clinical Oncology | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating the Efficacy of AI Chatbots as Tutors in Urology: A Comparative Analysis of Responses to the 2022 In-Service Assessment of the European Board of Urology

Abstract

Talk to us

Similar Papers

More From: Urologia Internationalis