Case-based repeatability of machine learning classification performance on breast MRI

Michael Vieceli,Karen Drukker,Heather M Whitney,Maryellen L Giger,Hiroyuki Abe,Amy Van Dusen,Horst K Hahn,Maciej A Mazurowski

doi:10.1117/12.2548144

Abstract

Computer-aided diagnosis and radiomics have shown potential in diagnosis and prognosis of breast cancer. The purpose of this study was to investigate repeatability of classifier output and its relationship to classification performance of breast lesions imaged with dynamic contrast-enhanced MRI. Images of 1,169 breast lesions (267 benign, 902 cancers) were retrospectively collected under HIPAA/IRB compliance. The lesions were segmented automatically using a fuzzy c-means method and thirty-eight radiomic features were extracted. Three classification tasks were investigated, with different proportions of cases in each class: (i) benign (23%) vs. malignant (77%), (ii) “pure” ductal carcinoma in situ (DCIS) (25%) vs. DCIS with invasive ductal carcinoma (IDC) (75%), and (iii) invasive cancers of molecular subtype luminal A or luminal B (66%) vs. other molecular subtypes (34%). For each task, support vector machine classifiers were trained and tested within 0.632+ bootstrap analyses (1000 iterations) and the 0.632+ bias-corrected area under the ROC curve (AUC) served as the classification performance metric. Repeatability of classifier output was evaluated at three levels: a) repeatability by case (performance metric: width of the 95% confidence interval of classifier-estimated posterior probabilities for each case), b) repeatability within the dataset (performance metric: median and 95% confidence interval of the by-case 95% confidence interval widths), and c) potential relationship between classification performance and repeatability. In classification performance assessment, median AUCs [95% confidence interval] for the three tasks were 0.85 [0.83, 0.87], 0.84 [0.80, 0.87], and 0.65 [0.60, 0.69], respectively. In repeatability assessment within the dataset, the median confidence interval widths [95% confidence interval] for the posterior probabilities were 0.25 [0.08, 0.72], 0.34 [0.14, 0.84], and 0.23 [0.14, 0.68]. In conclusion, the classifiers in the first two tasks demonstrated strong classification performance while in all three they showed similar repeatability in posterior probabilities.

Full Text