Background The rapid improvement of generative artificial intelligence (AI) models in medical domains including answering board-style questions warrants further investigation regarding their utility and accuracy in answering orthopaedic surgery written board questions. Previous studies have analyzed the performance of ChatGPT alone on board exams, but a head-to-head analysis of multiple current AI models has yet to be performed. Hence, the objective of this study was to compare the utility and accuracy of various large language models (LLMs) in answering Orthopaedic Surgery In-Training Exam (OITE) written board questions to each other as well as orthopaedic surgery residents. Methods A complete set of questions from the OITE 2022 exam was inputted into various LLMs and results were calculated and compared against orthopaedic surgery residents nationally. Results were analyzed by overall performance and question type. Type A questions related to knowledge and recall of facts, Type B questions involved diagnosis and analysis of information, and Type C questions focused on the evaluation and management of diseases, requiring knowledge and reasoning to develop treatment plans. Results Google Gemini was the most accurate tool answering 69.9% of questions correctly. Google Gemini also performed superiorly to ChatGPT and Claude on Type A (76.9%) and Type C questions (67.4%), with Claude performing superiorly on Type B questions (70.7%). Questions without images were answered with greater accuracy compared to those with images (65.9% vs. 34.1%). All LLMs performed above the average of a first-year orthopaedic surgery intern, with Google Gemini and Claude performance approaching that of fourth- and fifth-year orthopaedic surgery residents. Conclusion The study assessed LLMs like Google Gemini, ChatGPT, and Claude against orthopaedic surgery residents on the OITE. Results showed that these LLMs perform on par with orthopaedic surgery residents, with Google Gemini achieving best performance overall and in Type A and C questions while Claude performed best in Type B questions. LLMs have the potential to be used to generate formative feedback and interactive case studies for orthopaedic trainees.
Read full abstract