Evaluating the Reliability of Chat-GPT Model Responses for Radiation Oncology Patient Inquiries

W Floyd,T Kleber,M Pasli,J.J Qazi,C.C Huang,J.X Leng,B Ackerson,D.J Carpenter,J.K Salama,M.J Boyer

doi:10.1016/j.ijrobp.2023.06.2497

Abstract

To determine if ChatGPT, a popular deep learning text generation tool, accurately and comprehensively answers patient questions related to radiation oncology. A total of 28 common patient-centered questions were selected across various radiation oncology content domains, including diagnosis (4), workup (3), treatment (8), toxicity (4), and survivorship (9). To assess whether ChatGPT could detect inaccurate assumptions and/or respond negatively, we included two "negative control" questions in the treatment and toxicity domains. All questions were applied to common cancer types (breast, non-small cell lung, prostate, p16+ oropharyngeal, and rectal), uncommon cancer types (hypopharyngeal, medulloblastoma, and vulvar), and colon cancer as an additional "negative control." The ChatGPT responses were graded as 0 for any incorrect information, 1 for missing essential content, and 2 for correct and appropriately comprehensive for the length of the response. Each response was graded by two blinded MD reviewers, with discordant answers resolved by a third MD reviewer. Score distribution was compared across content domains, question type ("negative control" vs other), cancer type, and cancer commonality using the Chi-squared test. Overall, a total of 252 questions were submitted to ChatGPT. A total of 86 (34.1%) answers were found to contain inaccurate information, 66 (26.2%) contained correct information but were found to be missing essential context, and 100 (39.7%) responses to questions were graded as correct and comprehensive. There was no significant difference in response score by question domains (p = 0.07). However, there was significant difference in response score across cancer type (p<0.001). The top scoring cancer types were breast (grade 0 = 10%; grade 1 = 21%, grade 2 = 68%) and prostate (grade 0 = 18%, grade 1 = 25%, grade 2 = 57%), while the two lowest scoring cancer types were colon (grade 0 = 61%, grade 1 = 21%, grade 2 = 18%) and vulvar (grade 0 = 50%, grade 1 = 25%, grade 2 = 25%). ChatGPT responses were also significantly different among common, uncommon and negative control questions, with the model performing best with responses to common cancer types (p = 0.003). ChatGPT performed significantly worse when responding to "negative control" questions (p<0.001). ChatGPT failed to consistently generate accurate and comprehensive responses to the majority of radiation oncology patient centered questions, particularly across less common cancers and with "negative control" questions that included incorrect assumptions. This raises concern for the possible ChatGPT mediated reinforcement of patient misperceptions regarding radiotherapy.

Full Text