Background: Orofacial pain (OFP) encompasses a complex array of conditions affecting the face, mouth, and jaws, often leading to significant diagnostic challenges and high rates of misdiagnosis. Artificial intelligence, particularly large language models like GPT4 (OpenAI, San Francisco, CA, USA), offers potential as a diagnostic aid in healthcare settings. Objective: To evaluate the diagnostic accuracy of GPT4 in OFP cases as a clinical decision support system (CDSS) and compare its performance against treating clinicians, expert evaluators, medical students, and general practitioners. Methods: A total of 100 anonymized patient case descriptions involving diverse OFP conditions were collected. GPT4 was prompted to generate primary and differential diagnoses for each case using the International Classification of Orofacial Pain (ICOP) criteria. Diagnoses were compared to gold-standard diagnoses established by treating clinicians, and a scoring system was used to assess accuracy at three hierarchical ICOP levels. A subset of 24 cases was also evaluated by two clinical experts, two final-year medical students, and two general practitioners for comparative analysis. Diagnostic performance and interrater reliability were calculated. Results: GPT4 achieved the highest accuracy level (ICOP level 3) in 38% of cases, with an overall diagnostic performance score of 157 out of 300 points (52%). The model provided accurate differential diagnoses in 80% of cases (400 out of 500 points). In the subset of 24 cases, the model’s performance was comparable to non-expert human evaluators but was surpassed by clinical experts, who correctly diagnosed 54% of cases at level 3. GPT4 demonstrated high accuracy in specific categories, correctly diagnosing 81% of trigeminal neuralgia cases at level 3. Interrater reliability between GPT4 and human evaluators was low (κ = 0.219, p < 0.001), indicating variability in diagnostic agreement. Conclusions: GPT4 shows promise as a CDSS for OFP by improving diagnostic accuracy and offering structured differential diagnoses. While not yet outperforming expert clinicians, GPT4 can augment diagnostic workflows, particularly in primary care or educational settings. Effective integration into clinical practice requires adherence to rigorous guidelines, thorough validation, and ongoing professional oversight to ensure patient safety and diagnostic reliability.
Read full abstract