Validation of a deep learning model for traumatic brain injury detection and NIRIS grading on non-contrast CT: a multi-reader study with promising results and opportunities for improvement.

Bin Jiang,Sean Creeden,Casey H Halpern,Victoria Y Ding,Burak Berksu Ozkara,Jonathon J Parker,Bryan Lanzman,Alexander Khalaf,Sara Shams,Ying Li,Austin Trinh,Guangming Zhu,Hui Chen,Max Wintermark,Dylan Wolman

doi:10.1007/s00234-023-03170-5

Bin Jiang, Sean Creeden + Show 13 more

https://doi.org/10.1007/s00234-023-03170-5

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

This study aimed to assess and externally validate the performance of a deep learning (DL) model for the interpretation of non-contrast computed tomography (NCCT) scans of patients with suspicion of traumatic brain injury (TBI). This retrospective and multi-reader study included patients with TBI suspicion who were transported to the emergency department and underwent NCCT scans. Eight reviewers, with varying levels of training and experience (two neuroradiology attendings, two neuroradiology fellows, two neuroradiology residents, one neurosurgery attending, and one neurosurgery resident), independently evaluated NCCT head scans. The same scans were evaluated using the version 5.0 of the DL model icobrain tbi. The establishment of the ground truth involved a thorough assessment of all accessible clinical and laboratory data, as well as follow-up imaging studies, including NCCT and magnetic resonance imaging, as a consensus amongst the study reviewers. The outcomes of interest included neuroimaging radiological interpretation system (NIRIS) scores, the presence of midline shift, mass effect, hemorrhagic lesions, hydrocephalus, and severe hydrocephalus, as well as measurements of midline shift and volumes of hemorrhagic lesions. Comparisons using weighted Cohen's kappa coefficient were made. The McNemar test was used to compare the diagnostic performance. Bland-Altman plots were used to compare measurements. One hundred patients were included, with the DL model successfully categorizing 77 scans. The median age for the total group was 48, with the omitted group having a median age of 44.5 and the included group having a median age of 48. The DL model demonstrated moderate agreement with the ground truth, trainees, and attendings. With the DL model's assistance, trainees' agreement with the ground truth improved. The DL model showed high specificity (0.88) and positive predictive value (0.96) in classifying NIRIS scores as 0-2 or 3-4. Trainees and attendings had the highest accuracy (0.95). The DL model's performance in classifying various TBI CT imaging common data elements was comparable to that of trainees and attendings. The average difference for the DL model in quantifying the volume of hemorrhagic lesions was 6.0mL with a wide 95% confidence interval (CI) of - 68.32 to 80.22, and for midline shift, the average difference was 1.4mm with a 95% CI of - 3.4 to 6.2. While the DL model outperformed trainees in some aspects, attendings' assessments remained superior in most instances. Using the DL model as an assistive tool benefited trainees, improving their NIRIS score agreement with the ground truth. Although the DL model showed high potential in classifying some TBI CT imaging common data elements, further refinement and optimization are necessary to enhance its clinical utility.

Full Text