Abstract Importance: Recent breakthroughs in Large Language models (LLM) have shown impressive reasoning capabilities. Inferring disease trajectory from radiology reports is a central task in clinical oncology, and radiographic interpretation of disease state and treatment benefits in pancreas cancer can be particularly challenging. We assess the performance of LLMs (GPT3.5-turbo and GPT4) to infer disease status, response to treatment, and location of disease for patients with pancreatic adenocarcinoma from clinical radiology reports. We also assess the ability of LLMs to extrapolate from objective findings, vs. their reliance on radiologist’s summaries. Experimental Design: We used 200 deidentified radiology reports from pancreatic cancer patients at different stages of treatment and submitted these reports to the GPT3.5 and GPT4 models via an Microsoft Azure HIPAA compliant instance. We assess the reasoning capabilities of these two models in the three independent tasks in pancreatic cancer imaging. 1) disease status/response to treatment, 2) disease location, and 3) presence of indeterminate nodules requiring further follow up. Evaluation of response was compared to a medical oncologist’s interpretation of each report. A second medical oncologist served as adjudicator for discrepancies between human and GPT models and rates of data fabrication. We also assessed importance of Radiologists summaries vs Objective finding sections, as well as different prompt engineering approaches. Results: We found a maximum performance of 71% and 85% of GPT3.5 and GPT4 respectively when asked to classify radiology reports into seven categories of malignancy presence and trajectory, as compared to a medical oncologist’s interpretation. This accuracy was strongly dependent on requesting the model demonstrate its reasoning process before providing an answer; though this dependency was abrogated by providing the Radiologist’s impression section. We also found the main source of mis-classification was incongruence between reasoning and the final classification (as opposed to incorrect medical reasoning). Finally, we showed that GPT4 performs strongly on identifying specific organ sites of disease, as well as presence of indeterminate findings requiring further follow up. Conclusion: GPT algorithms accurately interpret multiple clinically relevant features in radiology reports for Pancreatic adenocarcinoma patients. Precision improved markedly with GPT 4 compared to GPT-3.5. This suggests GPT-based tools are potential valuable clinical allies in interpreting radiologist-produced text and could be valuable aids in cohort identification for large scale real-world data research and eventually clinical management support. Citation Format: Travis Ian Zack, Madhumita Sushil, Brenda Miao, Arda Demirci, Corynn Kasap, Margaret Tempero, Atul Butte, Eric Collisson. Clinical inference of location and trajectory of pancreatic cancer from radiology reports using zero-shot LLM [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Pancreatic Cancer; 2023 Sep 27-30; Boston, Massachusetts. Philadelphia (PA): AACR; Cancer Res 2024;84(2 Suppl):Abstract nr B074.
Read full abstract