Objective: The aim of this study was to test the success of ChatGPT-4 in evaluating chest radiographs and detecting abnormal findings, and then to demonstrate its utility in computed tomography (CT) justification. Methods: This study included 59 patients (20 patients in the first phase, and 39 patients in the second phase) from a publicly available chest X-ray dataset. X-rays were evaluated by an experienced chest radiologist (as gold standard), two radiology residents, and ChatGPT, first as normal-abnormal and then whether CT was needed if abnormal. Finally, the ChatGPT and two radiology residents' decisions were compared with the gold standard decision of the expert radiologist to obtain an accuracy value. Results: The accuracy of Resident 1, Resident 2, and ChatGPT for normal-abnormal labeling was 76.27%, 93.22%, and 76.27%, respectively, for a total of 59 patients. The accuracy of Resident 1, Resident 2, and ChatGPT for CT necessity was 67.80%, 72.88%, and 66.10%, respectively. The expert radiologist determined that CT was not necessary in 30 patients. Of these 30 patients, Resident 1, Resident 2, and ChatGPT answered incorrectly in 14, 12, and 15 patients, respectively. There is no statistically significant difference between the responses of Resident 1, Resident 2, and ChatGPT for CT necessity (Chi-square, p=0.731). Conclusion: The results of this study show that ChatGPT-4 is promising for chest X-ray interpretation and justification of CT scans. However, large language models such as ChatGPT, which still have major limitations, should be trained with a much larger number of radiology images.
Read full abstract