The large language model ChatGPT can now accept image input with the GPT4-vision (GPT4V) version. We aimed to compare the performance of GPT4V to pretrained U-Net and vision transformer (ViT) models for the identification of the progression of multiple sclerosis (MS) on magnetic resonance imaging (MRI). Paired coregistered MR images with and without progression were provided as input to ChatGPT4V in a zero-shot experiment to identify radiologic progression. Its performance was compared to pretrained U-Net and ViT models. Accuracy was the primary evaluation metric and 95% confidence interval (CIs) were calculated by bootstrapping. We included 170 patients with MS (50 males, 120 females), aged 21-74 years (mean 42.3), imaged at a single institution from 2019 to 2021, each with 2-5 MRI studies (496 in total). One hundred seventy patients were included, 110 for training, 30 for tuning, and 30 for testing; 100 unseen paired images were randomly selected from the test set for evaluation. Both U-Net and ViT had 94% (95% CI: 89-98%) accuracy while GPT4V had 85% (77-91%). GPT4V gave cautious nonanswers in six cases. GPT4V had precision (specificity), recall (sensitivity), and F1 score of 89% (75-93%), 92% (82-98%), 91 (82-97%) compared to 100% (100-100%), 88 (78-96%), and 0.94 (88-98%) for U-Net and 94% (87-100%), 94 (88-100%), and 94 (89-98%) for ViT. The performance of GPT4V combined with its accessibility suggests has the potential to impact AI radiology research. However, misclassified cases and overly cautious non-answers confirm that it is not yet ready for clinical use. GPT4V can identify the radiologic progression of MS in a simplified experimental setting. However, GPT4V is not a medical device, and its widespread availability highlights the need for caution and education for lay users, especially those with limited access to expert healthcare. Without fine-tuning or the need for prior coding experience, GPT4V can perform a zero-shot radiologic change detection task with reasonable accuracy. However, in absolute terms, in a simplified "spot the difference" medical imaging task, GPT4V was inferior to state-of-the-art computer vision methods. GPT4V's performance metrics were more similar to the ViT than the U-net. This is an exploratory experimental study and GPT4V is not intended for use as a medical device.
Read full abstract