Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment.

Victoria Yaneva,Peter Baldwin,Daniel P Jurich,Kimberly Swygert,Brian E Clauser

doi:10.1097/acm.0000000000005549

Victoria Yaneva, Peter Baldwin + Show 3 more

Open Access

https://doi.org/10.1097/acm.0000000000005549

Copy DOI

Abstract

In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT. As with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023. For the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning. Achieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Academic medicine : journal of the Association of American Medical Colleges	Publication Date: Nov 7, 2023
Citations: 11	License type: CC BY-NC-ND 4.0

R Discovery Prime

R Discovery Prime

Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment.

Abstract

Talk to us

Similar Papers

More From: Academic medicine : journal of the Association of American Medical Colleges

Lead the way for us

Similar Papers

Correlation Between United States Medical Licensing Examination and Comprehensive Osteopathic Medical Licensing Examination Scores for Applicants to a Dually Approved Emergency Medicine Residency
Kathleen E Kane ... Bryan G Kane
Journal of Emergency Medicine | VOL. 52
Kathleen E Kane, et. al.Kathleen E Kane ... Bryan G Kane
15 Nov 2016
Journal of Emergency Medicine | VOL. 52

COMLEX-USA and USMLE for Osteopathic Medical Students: Should We Duplicate, Divide, or Unify?
Harris Ahmed ... J Bryan Carmody
Journal of Graduate Medical Education | VOL. 14
Harris Ahmed, et. al.Harris Ahmed ... J Bryan Carmody
01 Feb 2022
Journal of Graduate Medical Education | VOL. 14

Skin of color representation in medical education: An analysis of National Board of Medical Examiners' self-assessments and popular question banks
Abigail L Meckley ... Robert P Dellavalle
Journal of the American Academy of Dermatology | VOL. 86
Abigail L Meckley, et. al.Abigail L Meckley ... Robert P Dellavalle
04 Oct 2021
Journal of the American Academy of Dermatology | VOL. 86

The Role of USMLE Scores in Selecting Residents
Gerard F Dillon ... Brian E Clauser
Academic Medicine | VOL. 86
Gerard F Dillon, et. al.Gerard F Dillon ... Brian E Clauser
01 Jul 2011
Academic Medicine | VOL. 86

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment.

Abstract

Talk to us

Similar Papers

More From: Academic medicine : journal of the Association of American Medical Colleges