Abstract Faced with challenging cases, doctors are increasingly seeking diagnostic advice from large language models (LLMs). This study aims to compare the ability of LLMs and human physicians to diagnose challenging cases. An offline dataset of 67 challenging cases with primary gastrointestinal symptoms was used to solicit possible diagnoses from seven LLMs and 22 gastroenterologists. The diagnoses by Claude 3.5 Sonnet covered the highest proportion (95% confidence interval [CI]) of instructive diagnoses (76.1%, [70.6%–80.9%]), significantly surpassing all the gastroenterologists (p < 0.05 for all). Claude 3.5 Sonnet achieved a significantly higher coverage rate (95% CI) than that of the gastroenterologists using search engines or other traditional resource (76.1% [70.6%–80.9%] vs. 45.5% [40.7%-50.4%], p < 0.001). The study highlights that advanced LLMs may assist gastroenterologists with instructive, time-saving, and cost-effective diagnostic scopes in challenging cases.
Read full abstract