Introduction: The study evaluates the performance of large language model versions of ChatGPT – ChatGPT-3.5, ChatGPT-4, and ChatGPT-Omni – in addressing inquiries related to the diagnosis and treatment of gynecological cancers, including ovarian, endometrial, and cervical cancers. Methods: A total of 804 questions were equally distributed across four categories: true/false, multiple-choice, open-ended, and case-scenario, with each question type representing varying levels of complexity. Performance was assessed using a six-point Likert scale, focusing on accuracy, completeness, and alignment with established clinical guidelines. Results: For true/false queries, ChatGPT-Omni achieved accuracy rates of 100% for easy, 98% for medium, and 97% for complicated questions, higher than ChatGPT-4 (94%, 90%, 85%) and ChatGPT-3.5 (90%, 85%, 80%) (p = 0.041, 0.023, 0.014, respectively). In multiple-choice, ChatGPT-Omni maintained superior accuracy with 100% for easy, 98% for medium, and 93% for complicated queries, compared to ChatGPT-4 (92%, 88%, 80%) and ChatGPT-3.5 (85%, 80%, 70%) (p = 0.035, 0.028, 0.011). For open-ended questions, ChatGPT-Omni had mean Likert scores of 5.8 for easy, 5.5 for medium, and 5.2 for complex levels, outperforming ChatGPT-4 (5.4, 5.0, 4.5) and ChatGPT-3.5 (5.0, 4.5, 4.0) (p = 0.037, 0.026, 0.015). Similar trends were observed in case-scenario questions, where ChatGPT-Omni achieved scores of 5.6, 5.3, and 4.9 for easy, medium, and hard levels, respectively (p = 0.017, 0.008, 0.012). Conclusions: ChatGPT-Omni exhibited superior performance in responding to clinical queries related to gynecological cancers, underscoring its potential utility as a decision support tool and an educational resource in clinical practice.
Read full abstract