Performance of large language models in oral and maxillofacial surgery examinations

B Quah,C.W Yong,C.W.M Lai,I Islam

doi:10.1016/j.ijom.2024.06.003

Abstract

This study aimed to determine the accuracy of large language models (LLMs) in answering oral and maxillofacial surgery (OMS) multiple choice questions. A total of 259 questions from the university’s question bank were answered by the LLMs (GPT-3.5, GPT-4, Llama 2, Gemini, and Copilot). The scores per category as well as the total score out of 259 were recorded and evaluated, with the passing score set at 50%. The mean overall score amongst all LLMs was 62.5%. GPT-4 performed the best (76.8%, 95% confidence interval (CI) 71.4–82.2%), followed by Copilot (72.6%, 95% CI 67.2–78.0%), GPT-3.5 (62.2%, 95% CI 56.4–68.0%), Gemini (58.7%, 95% CI 52.9–64.5%), and Llama 2 (42.5%, 95% CI 37.1–48.6%). There was a statistically significant difference between the scores of the five LLMs overall (χ2 = 79.9, df = 4, P < 0.001) and within all categories except ‘basic sciences’ (P = 0.129), ‘dentoalveolar and implant surgery’ (P = 0.052), and ‘oral medicine/pathology/radiology’ (P = 0.801). The LLMs performed best in ‘basic sciences’ (68.9%) and poorest in ‘pharmacology’ (45.9%). The LLMs can be used as adjuncts in teaching, but should not be used for clinical decision-making until the models are further developed and validated.

Full Text