Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential

Reema Mahmoud,Amir Shuster,Shlomi Kleinman,Shimrit Arbel,Clariel Ianculovici,Oren Peleg

doi:10.1016/j.joms.2024.11.007

Reema Mahmoud, Amir Shuster + Show 4 more

https://doi.org/10.1016/j.joms.2024.11.007

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

BackgroundWhile artificial intelligence (AI) has significantly impacted medicine, the application of large language models (LLMs) in oral and maxillofacial surgery (OMS) remains underexplored. PurposeThis study aimed to measure and compare the accuracy of four leading LLMs on OMS board examination questions and to identify specific areas for improvement. Study design, setting, and sampleAn in-silico cross-sectional study was conducted to evaluate four AI chatbots on 714 OMS board examination questions. Predictor variableThe predictor variable was the LLM used — LLM 1 (Generative Pre-trained Transformer 4o [GPT-4o], OpenAI, San Francisco, CA), LLM 2 (Generative Pre-trained Transformer 3.5 [GPT-3.5], OpenAI, San Francisco, CA), LLM 3 (Gemini, Google, Mountain View, CA), and LLM 4 (Copilot, Microsoft, Redmond, WA). Outcome variablesThe primary outcome variable was accuracy, defined as the percentage of correct answers provided by each LLM. Secondary outcomes included the LLMs’ ability to correct errors on subsequent attempts, and their performance across 11 specific OMS subject domains: Medicine and Anesthesia, Dentoalveolar and Implant Surgery, Maxillofacial Trauma, Maxillofacial Infections, Maxillofacial Pathology, Salivary Glands, Oncology, Maxillofacial Reconstruction, Temporomandibular Joint Anatomy and Pathology, Craniofacial and Clefts, and Orthognathic Surgery. CovariatesNo additional covariates were considered. AnalysesStatistical analysis included one-way ANOVA and post-hoc Tukey HSD to compare performance across chatbots. Chi-square tests were used to assess response consistency and error correction, with statistical significance set at p < 0.05. ResultsLLM 1 achieved the highest accuracy with an average score of 83.69%, statistically significantly outperforming LLM 3 (66.85%, p=0.002), LLM 2 (64.83%, p=0.001), and LLM 4 (62.18%, p<0.001). Across the 11 OMS subject domains, LLM 1 consistently had the highest accuracy rates. LLM 1 also corrected 98.2% of errors, while LLM 2 corrected 93.44%, both statistically significantly higher than LLM 4 (29.26%) and LLM 3 (70.71%) (p<0.001). Conclusions and relevanceLLM 1 (GPT-4o) significantly outperformed other models in both accuracy and error correction, indicating its strong potential as a tool for enhancing OMS education. However, the variability in performance across different domains highlights the need for ongoing refinement and continued evaluation to integrate these LLMs more effectively into the OMS field.

Full Text