Abstract

ObjectiveRecently, the use of large language models (LLMs) in medicine has become a prominent topic of discussion due to the rapid improvement of these tools in understanding and responding to natural language. Several models are widely available to the public, both proprietary and open-sourced. We aim to evaluate the possible use of such LLMs in vascular surgery by understanding their abilities to process common consult requests. MethodsThe senior author created 25 fictional vascular surgery consultation queries based on common consultation requests. Five attending surgeons and four LLMs (GPT 3.5, GPT 4, Bard, and Falcon 40B) were asked to answer whether each consult was an emergency that needed immediate attention within an hour. Responders were also asked whether the next best step was an examination, additional imaging, or an urgent operation. GPT 3.5 and 4 also provided free-response answers on the next best step, graded by attending surgeons based on scientific accuracy, possible harm, and content completeness. ResultsThe rates of accurate emergency identification were 88%, 100%, 76%, and 88% for GPT 3.5, GPT 4, Falcon 40B, and Bard, respectively. Although they have similar overall accuracy, GPT 3.5 has a high sensitivity at 100%, whereas Bard has a high specificity at 90%. GPT 4.0 had 100% sensitivity and specificity. LLMs agreed with the majority surgeon opinion on the next best step in 64% (GPT 3.5), 32% (GPT 4), 68% (Falcon 40B), and 36% (Bard) of cases. GPT 3.5 and 4 had a collective ratio of 89.5% of answers adhering to the scientific consensus. Only 5% of responses were highly likely to cause clinically significant harm. Although only 4% included incorrect content, 17.5% of answers missed important content. There was no significant difference between GPT 3.5 and 4 regarding the free-response grade. ConclusionsExisting, widely available LLMs exhibited a solid ability to identify vascular emergencies, with GPT 4.0 agreeing with surgeon attendings in 100% of cases. However, these models continue to have identifiable deficiencies in treatment recommendations, a higher-level task. Future models might help triage incoming consults and provide preliminary management suggestions. The utility of such tools in clinical practice remains to be explored.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.