Inadequate Performance of ChatGPT on Orthopedic Board-Style Written Exams.

Chandler A Sparks,Edward V Contrada,Anthony J Scillia,Grace A Chester,Matthew J Kraeutler,Eric Zhu,Sydney M Fasulo

doi:10.7759/cureus.62643

Abstract

Chat Generative Pre-Trained Transformer (ChatGPT) is an artificial intelligence (AI) chatbot capable of delivering human-like responses to a seemingly infinite number of inquiries. For the technology to perform certain healthcare-related tasks or act as a study aid, the technology should have up-to-date knowledge and the ability to reason through medical information. The purpose of this study was to assess the orthopedic knowledge and reasoning ability of ChatGPT by querying it with orthopedic board-style questions. We queried ChatGPT (GPT-3.5) with a total of 472 questions from the Orthobullets dataset (n = 239), the 2022 Orthopaedic In-Training Examination (OITE) (n = 124), and the 2021 OITE (n = 109). The importance, difficulty, and category were recorded for questions from the Orthobullets question bank. Responses were assessed for answer choice correctness if the explanation given matched that of the dataset, answer integrity, and reason for incorrectness. ChatGPT correctly answered 55.9% (264/472) of questions and, of those answered correctly, gave an explanation that matched that of the dataset for 92.8% (245/264) of the questions. The chatbot used information internal to the question in all responses (100%) and used information external to the question (98.3%) as well as logical reasoning (96.4%) in most responses. There was no significant difference in the proportion of questions answered correctly between the datasets (P = 0.62). There was no significant difference in the proportion of questions answered correctly by question category (P = 0.67), importance (P = 0.95), or difficulty (P = 0.87) within the Orthobullets dataset questions. ChatGPT mostly got questions incorrect due to information error (i.e., failure to identify theinformation required to answer the question) (81.7% of incorrect responses). ChatGPT performs below a threshold likely to pass the American Board of Orthopedic Surgery (ABOS) Part I written exam. The chatbot's performance on the 2022 and 2021 OITEs was between the average performance of an intern and to second-year resident. A major limitation of the current model is the failure to identify the information required to correctly answer the questions.

Full Text