Does GPT-4 have neurophobia? Localization and diagnostic accuracy of an artificial intelligence-powered chatbot in clinical vignettes

Kristin Galetta,Ethan Meltzer

doi:10.1016/j.jns.2023.120804

Abstract

Background and Objectives: This is an observational study of the performance of an artificial intelligence-powered chatbot tasked with solving unknown neurologic case vignettes. The primary objective of the study is to assess the current capabilities of widely-accessible artificial intelligence within the field of clinical neurology in order to determine how this technology can be deployed in clinical practice, and what insights can be learned from its performance and translated to clinical education. Methods: This observational study tested the accuracy of GPT-4, an artificial intelligence-powered chatbot, at appropriately localizing and generating a differential diagnosis for a series of 29 clinical case vignettes. The cases were from previously published educational material prepared for learners. No cases required more than text input, a current limitation of GPT-4. The primary outcome measures were ranked accuracy of localization and differential diagnosis based on clinical history and exam alone and after ancillary clinical data was provided. Secondary outcome measures included a comparison of accuracy by case difficulty. Results: GPT-4 identified the correct localization less than 50% of the time and performed worse when provided ancillary testing. GPT-4 was more accurate with localization and diagnosis of easier versus harder cases. Diagnostic accuracy was independent of its ability to localize the lesion. Discussion: GPT-4 did not perform as well on neurology clinical vignettes as compared to reported accuracy when provided other medical clinical vignettes. Incorporation of an AI chatbot into the practice of clinical neurology will require neurology-focused teaching.

Full Text