BackgroundLarge language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.MethodsWe evaluated ChatGPT4.0, Gemini Advanced, and GPTo1-preview on ten clinical questions. Eight derived from TG18, and two were formulated by the authors. Two authors independently rated the accuracy of each LLM's responses on a four-point scale: (1) accurate and comprehensive, (2) accurate but not comprehensive, (3) partially accurate, partially inaccurate, and (4) entirely inaccurate. A third author resolved any scoring discrepancies. Then, we comparatively analyzed the performance of ChatGPT4.0 against newer large language models (LLMs), specifically Gemini Advanced and GPTo1-preview, on the same set of questions to delineate their respective strengths and limitations.ResultsChatGPT4.0 provided consistent responses for 90% of the questions. It delivered "accurate and comprehensive" answers for 4/10 (40%) questions and "accurate but not comprehensive" answers for 5/10 (50%). One response (10%) was rated as "partially accurate, partially inaccurate." Gemini Advanced demonstrated higher accuracy on some questions but yielded a similar percentage of "partially accurate, partially inaccurate" responses. Notably, neither model produced "entirely inaccurate" answers.DiscussionLLMs, such as ChatGPT and Gemini Advanced, demonstrate potential in accurately addressing clinical questions regarding acute cholecystitis. With awareness of their limitations, their careful implementation, and ongoing refinement, LLMs could serve as valuable resources for physician education and patient information, potentially improving clinical decision-making in the future.
Read full abstract