Abstract
e13585 Background: The integration of Large Language Models (LLMs) into healthcare and medical education will represent a significant paradigm shift, offering transformative potential in how medical knowledge is accessed and assimilated. These models, however, have not yet been systematically trained, tested, or validated on complex medical information such as sub-specialty medical examinations. This study explores the performance of seven major LLMs in clinical radiation oncology using residency in-training exams. Methods: In this study, the 2021 American College of Radiology (ACR) Radiation Oncology In-Training Exam (TXIT) was used to evaluate the performance of various LLMs, including OpenAI's GPT-3.5-turbo, GPT-4, GPT-4-turbo, three Meta's Llama-2 models (7 billion, 13 billion, and 70 billion parameter), and Google's PaLM-2-text-bison. The ACR provided the publicly available national scoring for this exam. The exam comprised 298 questions across 13 domains, including clinical radiation oncology (195 questions, 65.4%). The exam was processed through each LLM via an application programming interface. LLM-generated answers were analyzed by clinical disease sites and compared to Radiation Oncology trainee performance and stratified by Post-Graduate Year (PGY) 2-5. Results: LLMs showed varied performance in the overall clinical radiation oncology domain, with OpenAI's GPT-4-turbo having the best performance with 68.0% correct answers, GPT-4 61.0%, GPT-3.5-turbo 48.0%, PaLM-2-text-bison 40.0%, and then the three Llama-2 models (70b 37.0%, 13b 38%, 7b 26%). GPT-4-turbo performed superiorly to lower-level (PGY2 51.6%, PGY3 61.6%) and comparably to upper-level radiation oncology trainees (PGY4 64.1%, PGY5 68.3%). Notably, GPT-4-turbo demonstrated 7.0% improvement over its predecessor GPT-4. LLMs scored the lowest in the gastrointestinal, genitourinary, and gynecology domains and highest in the bone and soft tissue, central nervous system and eye, and head, neck, and skin domains. Conclusions: GPT-4-turbo demonstrates clinical accuracy comparable to upper-level and superior to lower-level trainees in nearly all clinical domains. Conversely, Llama2 foundation models demonstrate overall worse performance than Level 1 (PGY2) trainees. Score discrepancies across disease site domains may be due to data availability, complexity of medical conditions, quality and quantity of training datasets, and interdisciplinary data inputs. Future research will need to evaluate the performance of models that are fine-tune trained in clinical oncology. This study also underscores the need for rigorous validation of LLM-generated information against established medical literature and expert consensus, necessitating expert oversight in their application in medical education and practice.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have