Abstract BACKGROUND Accurate differentiation between radiation necrosis and tumor progression or recurrence in patients treated with stereotactic radiosurgery for brain metastases is critical for guiding clinical management. This study leverages the advanced natural language processing capabilities of Meta Llama3, an artificial intelligence (AI) large language model (LLM), combined with prompt engineering, to rapidly categorize brain magnetic resonance imaging (MRI) radiology reports. Our objective was to develop an automated scoring system to classify concern for radiation necrosis, tumor progression, equivocal findings, or stable exams. METHODS Using a comprehensive dataset of reports annotated by expert radiologists, we ran inference on a 70-billion parameter Llama3 model (temperature 0.2, top_p 0.9) with specific prompts designed to capture the nuanced language and diagnostic criteria related to radiation necrosis or tumor progression. RESULTS The first pass was performed on a training dataset of 107 reports and did not predefine the clinical conditions. This demonstrated 43.4% accuracy in scoring when compared to a human user’s classification of each report. Agreement between the human reader and Llama3 was assessed using the Gwet agreement coefficient, AC1=0.411 (99%CI 0.396-0.426). Multiple iterations of targeted prompt engineering were then employed to narrow the definition of radiation necrosis and tumor progression, with specific examples and nuanced language used to achieve a higher degree of accuracy at 72.0%, AC1=0.719 (99%CI 0.717-0.722). DISCUSSION/FUTURE DIRECTIONS This surpasses recent demonstration of lower human-LLM agreement in radiographic score assignment, with further room for calibration on a dataset of several thousand reports. Next, we will correlate automated interpretations with actual clinical management decisions for radiation necrosis (e.g., initiation of steroids, bevacizumab, Laser Interstitial Thermal Therapy, and/or repeat imaging). This automated scoring system holds significant potential for LLMs in clinical applications. Future work will focus on integrating this model into clinical workflows and expanding its capabilities to include longitudinal monitoring of patient outcomes.
Read full abstract