ChatGPT is an Unreliable Tool for Reviewing Radiation Oncology Literature

T Kleber,M.J Boyer,B Ackerson,J.K Salama,C.C Huang,D.J Carpenter,W Floyd,M Pasli,J.X Leng,J.J Qazi

doi:10.1016/j.ijrobp.2023.06.1795

Abstract

To assess whether ChatGPT, a popular deep learning text generation tool, can serve as a resource for in-training and practicing clinicians by accurately identifying and summarizing studies related to radiation oncology. Three question templates (Q1-Q3, shown in Table 1) were applied to eight cancer types to compile 24 questions posed to ChatGPT. Cancer types were designated as either common (breast, non-small cell lung, prostate, p16 positive oropharyngeal, and rectal) or uncommon (hypopharyngeal, medulloblastoma, and vulvar). ChatGPT's responses to each question were then reviewed to quantify the number of studies referenced in the response, the percentage of studies listed that were real studies, and the percentage of studies listed that were correctly summarized. Outcomes were compared between cancer types (common vs uncommon) and question types using Wilcoxon rank sum tests. As a secondary analysis, we assessed internal consistency of ChatGPT's responses by querying ChatGPT with three identical iterations of Q1-Q3 for breast cancer and comparing its responses between iterations. Across all 24 of ChatGPT's responses, there were 78 studies referenced, of which 37 (47.4%) were real studies and 7 (9.0%) were correctly summarized. On average, each response included 3.25 (standard deviation (SD): 0.74) studies, of which 44.0% (SD: 44.2%) were real studies and 7.8% (SD: 14.6%) were correctly summarized. The proportion of correctly summarized studies was not significantly different between common vs uncommon cancers [p = 0.29], between questions that specified randomized-control trials (Q3) vs not (Q1 or Q2) [p = 0.94], or between questions that specified intensity modulated radiotherapy (Q2) vs not (Q1 or Q3) [p = 0.31]. Across the three iterations of ChatGPT queries for breast cancer, the number of studies listed for Q1, Q2, and Q3 ranged from 3 to 5, 2 to 3, and 3 to 5, respectively; the number of correctly summarized studies listed for each question ranged from 0 to 2, 0 to 1, and 0 to 1, respectively. ChatGPT's responses consistently included a large proportion of non-existent and incorrectly summarized studies. Furthermore, our secondary analysis suggests variability in the content and accuracy of ChatGPT responses to identical questions, raising further concerns regarding reliability. Overall, our findings argue against the use of ChatGPT as a tool for reviewing literature related to radiation oncology.

Full Text