Background: Automated extraction of actionable details of recommendations for additional imaging (RAIs) from radiology reports could facilitate tracking and timely completion of clinically necessary RAIs and thereby potentially reduce diagnostic delays. Purpose: To assess the performance of large-language models (LLMs) in extracting actionable details of RAIs from radiology reports. Methods: This retrospective single-center study evaluated reports of diagnostic radiology examinations performed across modalities and care settings within five subspecialties (abdominal imaging, musculoskeletal imaging, neuroradiology, nuclear medicine, thoracic imaging) in August 2023. Of reports identified by a previously validated natural-language processing algorithm to contain an RAI, 250 were randomly selected; 231 of these were confirmed to contain an RAI on manual review and formed the study sample. Twenty-five reports were used to engineer a prompt instructing an LLM, when inputted a report impression containing an RAI, to extract details about the modality, body part, timeframe, and rationale of the RAI; the remaining 206 reports were used for testing the prompt in combination with GPT-3.5 and GPT-4. A 4th-year medical student and radiologist from the relevant subspecialty independently classified the LLM outputs as correct versus incorrect for extracting the four actionable details of RAIs in comparison with the report impressions; a third reviewer assisted in resolving discrepancies. Extraction accuracy was summarized and compared between LLMs. Results: For GPT-3.5 and GPT-4, the two reviewers agreed for classification of LLM output as correct versus incorrect with respect to report impressions for 95.6% and 94.2% for RAI modality, 89.3% and 88.3% for RAI body part, 96.1% and 95.1% for RAI timeframe, and 89.8% and 88.8% for RAI rationale, respectively. Using consensus assessments, GPT-4 was more accurate than GPT-3.5 in extracting RAI modality (94.2% [194/206] vs 85.4% [176/206], p<.001), RAI body part (86.9% [179/206] vs 77.2% [159/206], p=.004), and RAI timeframe (99.0% [204/206] vs 95.6% [197/206], p=.02). Both LLMs had accuracy of 91.7% (189/206) for extracting RAI rationale. Conclusion: LLMs were used to extract actionable details of RAIs from free-text impression sections of radiology reports; GPT-4 outperformed GPT-3.5. Clinical Impact: The technique could represent an innovative method to facilitate timely completion of clinically necessary radiologist recommendations.
Read full abstract