Comparing Artificial Intelligence and Senior Residents in Oral Lesion Diagnosis: A Comparative Study.

Khaled F Almasaad,Zaid O Alzeer,Hamad Albagieh,Mohammed I Almahmoud,Abdulaziz N Naitah,Abdullah A Alkadhi,Turki S Alshahrani,Khalid S Alshahrani,Osama N Alasmari

doi:10.7759/cureus.51584

Abstract

Artificial intelligence (AI) is a field of computer science that seeks to build intelligent machines that can carry out tasks that usually necessitate human intelligence. AI may help dentists with a variety of dental tasks, including clinical diagnosis and treatment planning. This study aims to compare the performance of AI and oral medicine residents in diagnosing different cases, providing treatment, and determining if it is reliable to assist them in their field of work. The study conducted a comparative analysis of the responses from third- and fourth-year residents trained in Oral Medicine and Pathology at King Saud University, College of Dentistry. The residents were given a closed multiple-choice test consisting of 19 questions with four response options labeled A-D and one question with five response options labeled A-E. The test was administered via Google Forms, and each resident's response was stored electronically in an Excel sheet (Microsoft® Corp., Redmond, WA). The residents' answers were then compared to the responses generated by three major language models: OpenAI, Stablediffusion, and PopAI. The questions were inputted into the language models in the same format as the original test, and prior to each question, an artificial intelligence chat session was created to eliminate memory retention bias. The input was done on November 19, 2023, the same day the official multiple-choice test was administered. The study had a sample size of 20 residents trained in Oral Medicine and Pathology at King Saud University, College of Dentistry, consisting of both third-year and fourth-year residents. The responses of three large language models (LLM), including OpenAI, Stablediffusion, and PopAI, as well as the responses of 20 senior residents for 20 clinical cases about oral lesion diagnosis. There were no significant variations observed for the remaining questions in the responses to only two questions (10%). For the remaining questions, there were no significant differences. The median (IQR) score of LLMs was 50.0 (45.0 to 60.0), with a minimum of 40 (for stable diffusion) and a maximum of 70 (for OpenAI). The median (IQR) score of senior residents was 65.0 (55.0-75.0). The highest and lowest scores of residents were 40 and 90, respectively. There was no significant difference in the percent scores of residents and LLMs (p = 0.211). The agreement level was measured using the Kappa value. The agreement among senior dental residents was observed to be weak, with a Kappa value of 0.396. In contrast, the agreement among LLMs demonstrated a moderate level, with a Kappa value of 0.622, suggesting a more cohesive alignment in responses among the artificial intelligence models. When comparing residents' responses with those generated by different OpenAI models, including OpenAI, Stablediffusion, and PopAI, the agreement levels were consistently categorized as weak, with Kappa values of 0.402, 0.381, and 0.392, respectively. What the current study reveals is that when comparing the response score, there is no significant difference, in contrast to the agreement analysis among the residents, which was low compared to the LLMs, in which it was high. Dentists should consider that AI is very beneficial in providing diagnosis and treatment and use it to assist them.

Full Text