Evaluating the Performance of ChatGPT at Breast Tumor Board

Y Xu,N Logie,T Phan,L Barbera,R.A Nordal,J.M Stosky,S.L Lee

doi:10.1016/j.ijrobp.2023.06.1727

Y Xu, N Logie + Show 5 more

Open Access

PDF Available

https://doi.org/10.1016/j.ijrobp.2023.06.1727

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Chat Generative Pre-trained Transformer (ChatGPT) is a chatbot built on the GPT-3 language model. We sought to determine whether it can contribute to tumor board discussions by comparing the accuracy and clarity of its answers to challenging breast radiation oncology questions with that of human specialists. Twenty consecutive breast radiation oncology questions between January and February 2023 that received at least one human answer were curated from theMedNet, a physician-only Q&A platform for expert answers to real-world clinical situations. These questions were posed to ChatGPT, and its answers were paired with the first chronological human response. Breast radiation oncologists at one academic institution were asked to rate from 1 (strongly disagree) to 5 (strongly agree) the extent to which they agreed with each answer (accuracy score) and whether they felt the response provided clear and specific guidance relevant to the original question (clarity score). Wilson score intervals with continuity correction were used to estimate the proportion of answers on which ChatGPT receives a higher median accuracy or clarity score than human responders. The Wilcoxon signed-rank test was used to compare median accuracy and clarity scores across all of the 20 questions. Six board-certified breast radiation oncologists evaluated answers to the 20 questions, resulting in 120 distinct assessments of each of ChatGPT and human responders. The evaluators agreed or strongly agreed with ChatGPT responses on 49 (41%) of assessments and human responders on 66 (55%) of assessments. ChatGPT achieved a higher median accuracy score than human responders on 7 (35%; 95% Wilson score CI, 16-59%) questions whereas humans outperformed ChatGPT on 8 (40%) questions; there was no significant difference in median scores (Wilcoxon signed-rank p = 0.3). There was agreement or strong agreement that ChatGPT provided clear and specific guidance on 38 (32%) of assessments compared to 45 (38%) assessments of human answers. No differences were detected in median clarity score across all questions (Wilcoxon signed-rank p = 0.8). On 3 questions (15%; 95% Wilson score CI, 4-39%), ChatGPT surpassed human responders on both median accuracy score and median clarity score. Human responders similarly outperformed ChatGPT in both metrics on 3 (15%) questions. There was no detectable difference in the accuracy or clarity of answers provided by ChatGPT and human responders in this sample of 20 challenging breast radiation oncology questions. ChatGPT outperformed human responders in the accuracy and clarity of its answers to some questions, suggesting that it has the potential to contribute meaningfully to discussions about real-world clinical problems.

Full Text