Colorectal Cancer Prevention: Is Chat Generative Pretrained Transformer (Chat GPT) ready to Assist Physicians in Determining Appropriate Screening and Surveillance Recommendations?

Lisandro Pereyra,Francisco Schlottmann,Juan Lasa,Leandro Steinberg

doi:10.1097/mcg.0000000000001979

Abstract

To determine whether a publicly available advanced language model could help determine appropriate colorectal cancer (CRC) screening and surveillance recommendations. Poor physician knowledge or inability to accurately recall recommendations might affect adherence to CRC screening guidelines. Adoption of newer technologies can help improve the delivery of such preventive care services. An assessment with 10 multiple choice questions, including 5 CRC screening and 5 CRC surveillance clinical vignettes, was inputted into chat generative pretrained transformer (ChatGPT) 3.5 in 4 separate sessions. Responses were recorded and screened for accuracy to determine the reliability of this tool. The mean number of correct answers was then compared against a control group of gastroenterologists and colorectal surgeons answering the same questions with and without the help of a previously validated CRC screening mobile app. The average overall performance of ChatGPT was 45%. The mean number of correct answers was 2.75 (95% CI: 2.26-3.24), 1.75 (95% CI: 1.26-2.24), and 4.5 (95% CI: 3.93-5.07) for screening, surveillance, and total questions, respectively. ChatGPT showed inconsistency and gave a different answer in 4 questions among the different sessions. A total of 238 physicians also responded to the assessment; 123 (51.7%) without and 115 (48.3%) with the mobile app. The mean number of total correct answers of ChatGPT was significantly lower than those of physicians without [5.62 (95% CI: 5.32-5.92)] and with the mobile app [7.71 (95% CI: 7.39-8.03); P < 0.001]. Large language models developed with artificial intelligence require further refinements to serve as reliable assistants in clinical practice.

Full Text