We tested the ability of chat generative pretrained transformer (ChatGPT), an artificial intelligence chatbot, to answer questions relevant to scenarios covered in 3 clinical guidelines, published by the Society for Neuroscience in Anesthesiology and Critical Care (SNACC), which has published management guidelines: endovascular treatment of stroke, perioperative stroke (Stroke), and care of patients undergoing complex spine surgery (Spine). Four neuroanesthesiologists independently assessed whether ChatGPT could apply 52 high-quality recommendations (HQRs) included in the 3 SNACC guidelines. HQRs were deemed present in the ChatGPT responses if noted by at least 3 of the 4 reviewers. Reviewers also identified incorrect references, potentially harmful recommendations, and whether ChatGPT cited the SNACC guidelines. The overall reviewer agreement for the presence of HQRs in the ChatGPT answers ranged from 0% to 100%. Only 4 of 52 (8%) HQRs were deemed present by at least 3 of the 4 reviewers after 5 generic questions, and 23 (44%) HQRs were deemed present after at least 1 additional targeted question. Potentially harmful recommendations were identified for each of the 3 clinical scenarios and ChatGPT failed to cite the SNACC guidelines. The ChatGPT answers were open to human interpretation regarding whether the responses included the HQRs. Though targeted questions resulted in the inclusion of more HQRs than generic questions, fewer than 50% of HQRs were noted even after targeted questions. This suggests that ChatGPT should not currently be considered a reliable source of information for clinical decision-making. Future iterations of ChatGPT may refine algorithms to improve its reliability as a source of clinical information.
Read full abstract