Background The increasing workload of radiologists can lead to burnout and errors in radiology reports. Large language models, such as OpenAI's GPT-4, hold promise as error revision tools for radiology. Purpose To test the feasibility of GPT-4 use by determining its error detection, reasoning, and revision performance on head CT reports with varying error types and to validate its clinical utility by comparison with human readers. Materials and Methods A total of 10 300 head CT reports were retrospectively extracted from the Medical Information Mart for Intensive Care III public dataset. In experiment 1, among the 300 unaltered reports and 300 versions with applied errors, GPT-4 optimization was initially conducted with 200 reports. The remaining 400 were used for evaluation of error type detection, reasoning, and revision, as well as the analysis of reports with undetected errors. The performance was also compared with that of human readers. In experiment 2, the detection performance of GPT-4 was validated on 10 000 unaltered reports that were deemed error-free by physicians, and an analysis of false-positive results was conducted. A permutation test was conducted to assess differences in performance. Results GPT-4 demonstrated commendable performance in error detection (sensitivity, 84% for interpretive error and 89% for factual error), reasoning, and revision. Compared with GPT-4, human readers had worse factual error detection sensitivity (0.33-0.69 vs 0.89; P = .008 for radiologist 4, P < .001 for others) and took longer to review (82-121 seconds vs 16 seconds, P < .001). In 10 000 reports, GPT-4 detected 96 errors, with a low positive predictive value of 0.05, yet 14% of the false-positive responses were potentially beneficial. Conclusion GPT-4 effectively detects, reasons, and revises errors in radiology reports. While it shows excellent performance in identifying factual errors, its ability to prioritize clinically significant findings is limited. Recognizing its strengths and limitations, GPT-4 could serve as a feasible tool. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Choi in this issue.
Read full abstract