The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries.

Shan Zhou,Xiao Luo,Chan Chen,Hong Jiang,Chun Yang,Guanghui Ran,Juan Yu,Chengliang Yin

doi:10.1097/js9.0000000000001850

Abstract

Large language model (LLM)-powered chatbots have become increasingly prevalent in healthcare, while their capacity in oncology remains largely unknown. To evaluate the performance of LLM-powered chatbots compared to oncology physicians in addressing to colorectal cancer queries. This study was conducted between August 13, 2023, and January 5, 2024. A total of 150 questions were designed, and each question was submitted three times to eight chatbots: ChatGPT-3.5, ChatGPT-4, ChatGPT-4 Turbo, Doctor GPT, Llama-2-70B, Mixtral-8x7B, Bard, and Claude 2.1. No feedback was provided to these chatbots. The questions were also answered by nine oncology physicians, including three residents, three fellows, and three attendings. Each answer was scored based on its consistency with guidelines, with a score of 1 for consistent answers and 0 for inconsistent answers. The total score for each question was based on the number of corrected answers, ranging from 0 to 3. The accuracy and scores of the chatbots were compared to those of the physicians. Claude 2.1 demonstrated the highest accuracy, with an average accuracy of 82.67%, followed by Doctor GPT at 80.45%, ChatGPT-4 Turbo at 78.44%, ChatGPT-4 at 78%, Mixtral-8x7B at 73.33%, Bard at 70%, ChatGPT-3.5 at 64.89%, and Llama-2-70B at 61.78%. Claude 2.1 outperformed residents, fellows, and attendings. Doctor GPT outperformed residents and fellows. Additionally, Mixtral-8x7B outperformed residents. In terms of scores, Claude 2.1 outperformed residents and fellows. Doctor GPT, ChatGPT-4 Turbo and ChatGPT-4 outperformed residents. This study shows that LLM-powered chatbots can provide more accurate medical information compared to oncology physicians.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries.

Abstract

Talk to us

Similar Papers

More From: International journal of surgery (London, England)

Lead the way for us

Similar Papers

Performance of Large Language Models on a Neurology Board–Style Examination
Marc Cicero Schubert ... Varun Venkataramani
JAMA network open | VOL. 6
Marc Cicero Schubert, et. al.Marc Cicero Schubert ... Varun Venkataramani
07 Dec 2023
JAMA network open | VOL. 6

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
Ivan Civettini ... Carlo Gambacorti-Passerini
Blood | VOL. 142
Ivan Civettini, et. al.Ivan Civettini ... Carlo Gambacorti-Passerini
02 Nov 2023
Blood | VOL. 142

Performance of Large Language Models on Medical Oncology Examination Questions
Jack B Longwell ... Rahul G Krishnan
JAMA Network Open | VOL. 7
Jack B Longwell, et. al.Jack B Longwell ... Rahul G Krishnan
18 Jun 2024
JAMA Network Open | VOL. 7

Large language model may assist diagnosis of SAPHO syndrome by bone scintigraphy.
Yu Mori ... Ryuichi Kanabuchi
Modern rheumatology | VOL. 34
Yu Mori, et. al.Yu Mori ... Ryuichi Kanabuchi
28 Dec 2023
Modern rheumatology | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries.

Abstract

Talk to us

Similar Papers

More From: International journal of surgery (London, England)