Towards Improved Radiological Diagnostics: Investigating the Utility and Limitations of GPT-3.5 Turbo and GPT-4 with Quiz Cases.

Tomohiro Kikuchi,Takahiro Nakao,Yuta Nakamura,Shouhei Hanaoka,Harushi Mori,Takeharu Yoshikawa

doi:10.3174/ajnr.a8332

Abstract

The rise of large language models such as generative pre-trained transformers (GPTs) has sparked significant interest in radiology, especially in interpreting radiological reports and image findings. While existing research has focused on GPTs estimating diagnoses from radiological descriptions, exploring alternative diagnostic information sources is also crucial. This study introduces the use of GPTs (GPT-3.5 Turbo and GPT-4) for information retrieval and summarization, searching relevant case reports via PubMed, and investigates their potential to aid diagnosis. From October 2021 to December 2023, we selected 115 cases from the "Case of the Week" series on the American Journal of Neuroradiology website. Their Description and Legend sections were presented to the GPTs for the two tasks. For the Direct Diagnosis task, the models provided three differential diagnoses that were considered correct if they matched the diagnosis in the diagnosis section. For the Case Report Search task, the models generated two keywords per case, creating PubMed search queries to extract up to three relevant reports. A response was considered correct if reports containing the disease name stated in the diagnosis section were extracted. McNemar's test was employed to evaluate whether adding a Case Report Search to Direct Diagnosis improved overall accuracy. In the Direct Diagnosis task, GPT-3.5 Turbo achieved a correct response rate of 26% (30/115 cases), whereas GPT-4 achieved 41% (47/115). For the Case Report Search task, GPT-3.5 Turbo scored 10% (11/115), and GPT-4 scored 7% (8/115). Correct responses totaled 32% (37/115) with three overlapping cases for GPT-3.5 Turbo, whereas GPT-4 had 43% (50/115) of correct responses with five overlapping cases. Adding Case Report Search improved GPT-3.5 Turbo's performance (p = 0.023) but not that of GPT-4 (p = 0.248). The effectiveness of adding Case Report Search to GPT-3.5 Turbo was particularly pronounced, suggesting its potential as an alternative diagnostic approach to GPTs, particularly in scenarios where direct diagnoses from GPTs are not obtainable. Nevertheless, the overall performance of GPT models in both direct diagnosis and case report retrieval tasks remains not optimal, and users should be aware of their limitations.ABBREVIATIONS: AI = Artificial Intelligence, GPT = generative pretrained transformer, LLM = large language model.

Full Text