Evaluating the progression of artificial intelligence and large language models in medicine through comparative analysis of ChatGPT-3.5 and ChatGPT-4 in generating vascular surgery recommendations

Arshia P Javidan,Tiam Feridooni,Lauren Gordon,Sean A Crawford

doi:10.1016/j.jvsvi.2023.100049

Abstract

ObjectiveArtificial intelligence (AI) continues to become increasingly integrated with clinical medicine. Generative AI, and particularly large language models (LLMs) like ChatGPT-3.5 and ChatGPT-4, have shown promise in generating human-like text, providing a potential tool for augmenting clinical care. These online AI chatbots have already demonstrated remarkable clinical potential, having passed the US Medical Licensing Exam, for example. The evaluation of these LLMs in the surgical literature, especially as it applies to judgement and decision-making, is sparse. This study aimed to (1) evaluate the efficacy of ChatGPT-4 in providing clinician-level vascular surgery recommendations and (2) compare its performance with its predecessor, ChatGPT-3.5, to gauge the progression of clinical competencies of LLMs. MethodsA set of 40 clinician-level questions spanning 4 domains of vascular surgery (carotid artery disease, visceral artery aneurysms, abdominal aortic aneurysms, chronic limb-threatening ischemia) were generated by clinical experts. These domains were chosen based on the availability of updated guidelines published before September 2021, which served as the cutoff date for the training dataset of the LLMs. The questions, devoid of additional context or prompts, were input into ChatGPT-3.5 and ChatGPT-4 between March 20 and March 25, 2023. Responses were independently evaluated by two blinded reviewers using a 5-point Likert scale assessing comprehensiveness, accuracy, and consistency with guidelines. The Flesch-Kincaid grade level of each response was also determined. Independent samples t test and Fisher's exact test were used for comparative analysis. ResultsChatGPT-4 significantly outperformed ChatGPT-3.5 by providing appropriate recommendations in 38 of 40 questions (95%) as compared with 13 of 40 (32.5%) by ChatGPT-3.5 (Fisher's exact test P < .001). Despite longer response lengths (chatGPT-4 mean 317 ± 58 words vs chatGPT-3.5 mean 265 ± 74 words; P < .001), the reading ease of both models remained similar, corresponding with college-level graduate texts. ConclusionsChatGPT-4 can consistently respond accurately to complex clinician-level vascular surgery questions. This also represents a substantial advancement in performance compared with its predecessor, which was released only a few months prior, highlighting the progress of performance of LLMs in clinical medicine. Several limitations persist with the use of LLMs, including hallucinations, data privacy issues, and the black box problem, However, these findings suggest that, with further refinements, LLMs like ChatGPT-4 have the potential to become indispensable tools in clinical decision-making, thereby marking an exciting frontier in the fusion of AI with clinical medicine and vascular surgery.

Full Text