Special Issue on Informatics Education: ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.

Tessa Louise Danehy,Jessica Hecht,Sabrina Kentis,Clyde Schechter,Sunit Jariwala

doi:10.1055/a-2405-0138

Abstract

The main objective of this study is to evaluate the ability of the Large Language Model ChatGPT to accurately answer USMLE board style medical ethics questions compared to medical knowledge based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and to assess the variability of responses given by each version. Using AMBOSS, a third party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4, and recorded the output. A random-effects linear probability regression model evaluated accuracy, and a Shannon entropy calculation evaluated response variation. Both versions of ChatGPT demonstrated a worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points (P < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points (P = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points (P < 0.001) on medical ethics and 33% points (P < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55) which indicates lower variability in response. Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education. ChatGPT, Large Language Model, Artificial Intelligence, Medical Education, USMLE, Ethics.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Special Issue on Informatics Education: ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.

Abstract

Talk to us

Similar Papers

More From: Applied clinical informatics

Lead the way for us

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... Bianca Maria Colosimo
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... Bianca Maria Colosimo
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study
Mohammed Ahmed Sadeq ... Mohammed Ayyad
Scientific Reports | VOL. 14
Mohammed Ahmed Sadeq, et. al.Mohammed Ahmed Sadeq ... Mohammed Ayyad
14 Aug 2024
Scientific Reports | VOL. 14

Performance of Large Language Models on Medical Oncology Examination Questions
Jack B Longwell ... Rahul G Krishnan
JAMA Network Open | VOL. 7
Jack B Longwell, et. al.Jack B Longwell ... Rahul G Krishnan
18 Jun 2024
JAMA Network Open | VOL. 7

Decision letter: Promoter sequence and architecture determine expression variability and confer robustness to genetic variants
George H Perry
-
George H PerryGeorge H Perry
07 Sep 2022
07 Sep 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Special Issue on Informatics Education: ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.

Abstract

Talk to us

Similar Papers

More From: Applied clinical informatics