Background The performance of large-language models, such as ChatGPT, on medical and sub-specialty examinations has been preliminarily explored in fields such as radiology, obstetrics and gynecology, and orthopedic surgery. However, no literature assessing ChatGPT’s ability to answer hand surgery exam questions exists. This study’s purpose was to evaluate ChatGPT’s performance on hand surgery board-style examination questions. Methods All questions from the American Society for Surgery of the Hand (ASSH) Hand 100 Exam, Beginner, and Intermediate Assessment tools were entered into ChatGPT-3.5. Responses were regenerated two times to identify inconsistencies. Duplicate questions, questions with figures and/or videos, and questions that ChatGPT refused to provide a response to were excluded. ChatGPT’s correct response rate, answer modifications, and human accuracy were recorded. Results 117 questions from the 3 assessment tools were analyzed: 49 from the ASSH Hand 100, 32 from the Beginner, and 36 from the Intermediate Assessment tools. On ChatGPT’s initial attempt, 40.82% (20/49), 50.0% (16/32), 38.89% (14/36) of questions were correctly answered, respectively. Overall, ChatGPT correctly answered 50/117 (42.7%) of questions on the first try. ChatGPT excelled in topics (>60% correct) of mass/tumor, nerve, wrist, and performed poorly (<40% correct) on topics regarding anatomy/basic science/imaging, brachial plexus, congenital, elbow, tendon, and vascular disorders, trauma. On the Beginner and Intermediate Exams, humans correctly answered 56.64% and 62.73% of questions, respectively. Conclusions ChatGPT can correctly answer simpler hand surgery questions but performed poorly when compared to humans on higher-difficulty questions.
Read full abstract