Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.

Vanessa Brébant,Konstantin Frank,Michael Alfertshofer,Cosima C Hoch,Philipp Lamby,Leonard Knoedler,Bhagvat Maheta,Sebastian Cotofana,Paul F Funk,Lukas Prantl,Samuel Knoedler

doi:10.2196/51148

Abstract

The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student's knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT's performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited. This paper aimed to analyze ChatGPT's performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating. A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions. Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (ρ=-0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ρ=-0.289 for ChatGPT 3.5 and ρ=-0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached. In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: JMIR Medical Education	Publication Date: Jan 5, 2024
Citations: 14	License type: cc-by

R Discovery Prime

R Discovery Prime

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.

Abstract

Talk to us

Similar Papers

More From: JMIR Medical Education

Lead the way for us

Similar Papers

COMLEX-USA and USMLE for Osteopathic Medical Students: Should We Duplicate, Divide, or Unify?
Harris Ahmed ... J Bryan Carmody
Journal of Graduate Medical Education | VOL. 14
Harris Ahmed, et. al.Harris Ahmed ... J Bryan Carmody
01 Feb 2022
Journal of Graduate Medical Education | VOL. 14

Undergraduate institutional MCAT scores as predictors of USMLE step 1 performance.
William T Basco ... Gregory E Gilbert
Academic medicine : journal of the Association of American Medical Colleges | VOL. 77
William T Basco, et. al.William T Basco ... Gregory E Gilbert
01 Oct 2002
Academic medicine : journal of the Association of American Medical Colleges | VOL. 77

Do USMLE steps, and ITE score predict the American Board of Internal Medicine Certifying Exam results?
Supratik Rayamajhi ... Shiva Shrotriya
BMC Medical Education | VOL. 20
Supratik Rayamajhi, et. al.Supratik Rayamajhi ... Shiva Shrotriya
18 Mar 2020
BMC Medical Education | VOL. 20

Reporting of USMLE Step 1 as Pass/Fail: A Benefit for Residency Programs and Those Underrepresented in Medicine?
Joshua M Romero ... Claudia I Martinez
Journal of graduate medical education | VOL. 13
Joshua M Romero, et. al.Joshua M Romero ... Claudia I Martinez
22 Jan 2021
Journal of graduate medical education | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.

Abstract

Talk to us

Similar Papers

More From: JMIR Medical Education