Background Large-scale secondary use of clinical databases requires automated tools for retrospective extraction of structured content from free-text radiology reports. Purpose To share data and insights on the application of privacy-preserving open-weights large language models (LLMs) for reporting content extraction with comparison to standard rule-based systems and the closed-weights LLMs from OpenAI. Materials and Methods In this retrospective exploratory study conducted between May 2024 and September 2024, zero-shot prompting of 17 open-weights LLMs was preformed. These LLMs with model weights released under open licenses were compared with rule-based annotation and with OpenAI's GPT-4o, GPT-4o-mini, GPT-4-turbo, and GPT-3.5-turbo on a manually annotated public English chest radiography dataset (Indiana University, 3927 patients and reports). An annotated nonpublic German chest radiography dataset (18 500 reports, 16 844 patients [10 340 male; mean age, 62.6 years ± 21.5 {SD}]) was used to compare local fine-tuning of all open-weights LLMs via low-rank adaptation and 4-bit quantization to bidirectional encoder representations from transformers (BERT) with different subsets of reports (from 10 to 14 580). Nonoverlapping 95% CIs of macro-averaged F1 scores were defined as relevant differences. Results For the English reports, the highest zero-shot macro-averaged F1 score was observed for GPT-4o (92.4% [95% CI: 87.9, 95.9]); GPT-4o outperformed the rule-based CheXpert [Stanford University] (73.1% [95% CI: 65.1, 79.7]) but was comparable in performance to several open-weights LLMs (top three: Mistral-Large [Mistral AI], 92.6% [95% CI: 88.2, 96.0]; Llama-3.1-70b [Meta AI], 92.2% [95% CI: 87.1, 95.8]; and Llama-3.1-405b [Meta AI]: 90.3% [95% CI: 84.6, 94.5]). For the German reports, Mistral-Large (91.6% [95% CI: 90.5, 92.7]) had the highest zero-shot macro-averaged F1 score among the six other open-weights LLMs and outperformed the rule-based annotation (74.8% [95% CI: 73.3, 76.1]). Using 1000 reports for fine-tuning, all LLMs (top three: Mistral-Large, 94.3% [95% CI: 93.5, 95.2]; OpenBioLLM-70b [Saama]: 93.9% [95% CI: 92.9, 94.8]; and Mixtral-8×22b [Mistral AI]: 93.8% [95% CI: 92.8, 94.7]) achieved significantly higher macro-averaged F1 score than did BERT (86.7% [95% CI: 85.0, 88.3]); however, the differences were not relevant when 2000 or more reports were used for fine-tuning. Conclusion LLMs have the potential to outperform rule-based systems for zero-shot "out-of-the-box" structuring of report databases, with privacy-ensuring open-weights LLMs being competitive with closed-weights GPT-4o. Additionally, the open-weights LLM outperformed BERT when moderate numbers of reports were used for fine-tuning. Published under a CC BY 4.0 license. Supplemental material is available for this article. See also the editorial by Gee and Yao in this issue.
Read full abstract