Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Qiao Jin,Fangyuan Chen,Yiliang Zhou,Ziyang Xu,Justin M Cheung,Robert Chen,Ronald M Summers,Justin F Rousseau,Peiyun Ni,Marc J Landsman,Sally L Baxter,Subhi J Al'Aref,Yijia Li,Alexander Chen,Josef A Brejt,Michael F Chiang,Yifan Peng,Zhiyong Lu

doi:10.1038/s41746-024-01185-7

Abstract

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V’s rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges—an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V’s high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: npj Digital Medicine	Publication Date: Jul 23, 2024
Citations: 10	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Abstract

Talk to us

Similar Papers

More From: npj Digital Medicine

Lead the way for us

Similar Papers

Could the New England Journal of Medicine Be Biased Against Arthroscopic Knee Surgery? Part 2
James H Lubowitz ... Michael J Rossi
Arthroscopy: The Journal of Arthroscopic & Related Surgery | VOL. 30
James H Lubowitz, et. al.James H Lubowitz ... Michael J Rossi
23 May 2014
Arthroscopy: The Journal of Arthroscopic & Related Surgery | VOL. 30

Celiac Disease: The Endocrine Connection
Marianne Buzby
Journal of Pediatric Nursing | VOL. 25
Marianne BuzbyMarianne Buzby
01 Jul 2010
Journal of Pediatric Nursing | VOL. 25

Lapses at the New England Journal of Medicine
R Smith
Journal of the Royal Society of Medicine | VOL. 99
R SmithR Smith
01 Aug 2006
Journal of the Royal Society of Medicine | VOL. 99

The JAMA and NEJM Rulings and Their Impact on the Sanctity of Confidential Peer Review
Eric Berger
Annals of Emergency Medicine | VOL. 51
Eric BergerEric Berger
17 May 2008
Annals of Emergency Medicine | VOL. 51

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Abstract

Talk to us

Similar Papers

More From: npj Digital Medicine