GPT-4 Analysis of MRI Reports in Suspected Myocarditis: A Multicenter Study

Kenan Kaya,Carsten Gietzen,Robert Hahnfeldt,Maher Zoubi,Tilman Emrich,Moritz C Halfmann,Malte Maria Sieren,Yannic Elser,Patrick Krumm,Jan M Brendel,Konstantin Nikolaou,Nina Haag,Jan Borggrefe,Ricarda Von Krüchten,Katharina Müller-Peltzer,Constantin Ehrengut,Timm Denecke,Andreas Hagendorff,Lukas Goertz,Roman J Gertz,Alexander Christian Bunck,David Maintz,Thorsten Persigehl,Simon Lennartz,Julian A Luetkens,Astha Jaiswal,Andra Iza Iuga,Lenhard Pennig,Jonathan Kottlors

doi:10.1016/j.jocmr.2024.101068

Abstract

PurposeDiagnosing myocarditis relies on multimodal data including magnetic resonance imaging (MRI), clinical symptoms, and blood values. The correct interpretation and integration of MRI findings requires radiological expertise and knowledge. We aimed to investigate the performance of Generative Pre-trained Transformer 4 (GPT-4), a large language model, for report-based medical decision-making in the context of cardiac MRI for suspected myocarditis. MethodsThis retrospective study includes MRI reports from 396 patients with suspected myocarditis and eight centers, respectively. MRI reports and patient data including blood values, age, and further clinical information were provided to GPT-4 and to radiologists with 1 (Resident 1), 2 (Resident 2), and 4 years (Resident 3) of experience in cardiovascular MRI and knowledge of the 2018 Lake Louise Criteria. The final impression of the report regarding the radiological assessment of whether myocarditis is present or not was not provided. The performance of GPT-4 and of the human readers were compared to a consensus reading (two board-certified radiologists with 8 and 10 years of experience in cardiovascular MRI). Sensitivity, specificity, and accuracy were calculated. ResultsGPT-4 yielded an accuracy of 83%, sensitivity of 90%, and specificity of 78%, which was comparable to the physician with 1 year of experience (R1: 86%, 90%, 84%, p=.14) and lower than that of more experienced physicians (R2: 89%, 86%, 91%, p=.007 and R3: 91%, 85%, 96%, p<.001). GPT-4 and human readers showed a higher diagnostic performance when results from T1- and T2-mapping sequences were part of the reports, for Residents 1 and Resident 3 with statistical significance (p=.004 and p=.02, respectively). ConclusionGPT-4 yielded good accuracy for diagnosing myocarditis based on MRI reports in a large dataset from multiple centers and therefore holds the potential to serve as a diagnostic decision supporting tool in this capacity, particularly for less experienced physicians. Further studies are required to explore the full potential and elucidate educational aspects of the integration of large language models in medical decision-making.

Full Text