In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions

Leonard Knoedler,Samuel Knoedler,Samuel Knoedler,Cosima C Hoch,Lukas Prantl,Konstantin Frank,Laura Soiderer,Sebastian Cotofana,Sebastian Cotofana,Sebastian Cotofana,Amir H Dorafshar,Thilo Schenck,Felix Vollbach,Giuseppe Sofo,Michael Alfertshofer,Michael Alfertshofer

doi:10.1038/s41598-024-63997-7

Abstract

ChatGPT has garnered attention as a multifaceted AI chatbot with potential applications in medicine. Despite intriguing preliminary findings in areas such as clinical management and patient education, there remains a substantial knowledge gap in comprehensively understanding the chances and limitations of ChatGPT’s capabilities, especially in medical test-taking and education. A total of n = 2,729 USMLE Step 1 practice questions were extracted from the Amboss question bank. After excluding 352 image-based questions, a total of 2,377 text-based questions were further categorized and entered manually into ChatGPT, and its responses were recorded. ChatGPT’s overall performance was analyzed based on question difficulty, category, and content with regards to specific signal words and phrases. ChatGPT achieved an overall accuracy rate of 55.8% in a total number of n = 2,377 USMLE Step 1 preparation questions obtained from the Amboss online question bank. It demonstrated a significant inverse correlation between question difficulty and performance with rs = -0.306; p < 0.001, maintaining comparable accuracy to the human user peer group across different levels of question difficulty. Notably, ChatGPT outperformed in serology-related questions (61.1% vs. 53.8%; p = 0.005) but struggled with ECG-related content (42.9% vs. 55.6%; p = 0.021). ChatGPT achieved statistically significant worse performances in pathophysiology-related question stems. (Signal phrase = “what is the most likely/probable cause”). ChatGPT performed consistent across various question categories and difficulty levels. These findings emphasize the need for further investigations to explore the potential and limitations of ChatGPT in medical examination and education.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Reports	Publication Date: Jun 12, 2024
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions

Abstract

Talk to us

Similar Papers

More From: Scientific Reports

Lead the way for us

Similar Papers

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.
Vanessa Brébant ... Samuel Knoedler
JMIR Medical Education | VOL. 10
Vanessa Brébant, et. al.Vanessa Brébant ... Samuel Knoedler
05 Jan 2024
JMIR Medical Education | VOL. 10

Comparing the performance of artificial intelligence learning models to medical students in solving histology and embryology multiple choice questions
Miloš Bajčetić ... Ivan Zaletel
Annals of Anatomy - Anatomischer Anzeiger | VOL. 254
Miloš Bajčetić, et. al.Miloš Bajčetić ... Ivan Zaletel
21 Mar 2024
Annals of Anatomy - Anatomischer Anzeiger | VOL. 254

A genetic fuzzy expert system for automatic question classification in a competitive learning environment
Elena Verdú ... Ricardo García
Expert Systems With Applications | VOL. 39
Elena Verdú, et. al.Elena Verdú ... Ricardo García
25 Jan 2012
Expert Systems With Applications | VOL. 39

Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.
Gurnoor S Gill ... Jillene Moxam
Cureus | VOL. 16
Gurnoor S Gill, et. al.Gurnoor S Gill ... Jillene Moxam
17 Sep 2024
Cureus | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions

Abstract

Talk to us

Similar Papers

More From: Scientific Reports