Deep Learning to Differentiate Benign and Malignant Vertebral Fractures at Multidetector CT.

Sarah C Foreman,David Schinz,Malek El Husseini,Sophia S Goller,Jürgen Weißinger,Anna-Sophia Dietrich,Martin Renz,Marie-Christin Metz,Georg C Feuerriegel,Benedikt Wiestler,Robert Stahl,Benedikt J Schwaiger,Marcus R Makowski,Jan S Kirschke,Alexandra S Gersing

doi:10.1148/radiol.231429

Abstract

Background Differentiating between benign and malignant vertebral fractures poses diagnostic challenges. Purpose To investigate the reliability of CT-based deep learning models to differentiate between benign and malignant vertebral fractures. Materials and Methods CT scans acquired in patients with benign or malignant vertebral fractures from June 2005 to December 2022 at two university hospitals were retrospectively identified based on a composite reference standard that included histopathologic and radiologic information. An internal test set was randomly selected, and an external test set was obtained from an additional hospital. Models used a three-dimensional U-Net encoder-classifier architecture and applied data augmentation during training. Performance was evaluated using the area under the receiver operating characteristic curve (AUC) and compared with that of two residents and one fellowship-trained radiologist using the DeLong test. Results The training set included 381 patients (mean age, 69.9 years ± 11.4 [SD]; 193 male) with 1307 vertebrae (378 benign fractures, 447 malignant fractures, 482 malignant lesions). Internal and external test sets included 86 (mean age, 66.9 years ± 12; 45 male) and 65 (mean age, 68.8 years ± 12.5; 39 female) patients, respectively. The better-performing model of two training approaches achieved AUCs of 0.85 (95% CI: 0.77, 0.92) in the internal and 0.75 (95% CI: 0.64, 0.85) in the external test sets. Including an uncertainty category further improved performance to AUCs of 0.91 (95% CI: 0.83, 0.97) in the internal test set and 0.76 (95% CI: 0.64, 0.88) in the external test set. The AUC values of residents were lower than that of the best-performing model in the internal test set (AUC, 0.69 [95% CI: 0.59, 0.78] and 0.71 [95% CI: 0.61, 0.80]) and external test set (AUC, 0.70 [95% CI: 0.58, 0.80] and 0.71 [95% CI: 0.60, 0.82]), with significant differences only for the internal test set (P < .001). The AUCs of the fellowship-trained radiologist were similar to those of the best-performing model (internal test set, 0.86 [95% CI: 0.78, 0.93; P = .39]; external test set, 0.71 [95% CI: 0.60, 0.82; P = .46]). Conclusion Developed models showed a high discriminatory power to differentiate between benign and malignant vertebral fractures, surpassing or matching the performance of radiology residents and matching that of a fellowship-trained radiologist. © RSNA, 2024 See also the editorial by Booz and D'Angelo in this issue.

Full Text