The aim of this study was to investigate the concept of repeatability in a case-based performance evaluation of two classifiers commonly used in computer-aided diagnosis in the task of distinguishing benign from malignant lesions. The authors performed .632+ bootstrap analyses using a data set of 1251 sonographic lesions of which 212 were malignant. Several analyses were performed investigating the impact of sample size and number of bootstrap iterations. The classifiers investigated were a Bayesian neural net (BNN) with five hidden units and linear discriminant analysis (LDA). Both used the same four input lesion features. While the authors did evaluate classifier performance using receiver operating characteristic (ROC) analysis, the main focus was to investigate case-based performance based on the classifier output for individual cases, i.e., the classifier outputs for each test case measured over the bootstrap iterations. In this case-based analysis, the authors examined the classifier output variability and linked it to the concept of repeatability. Repeatability was assessed on the level of individual cases, overall for all cases in the data set, and regarding its dependence on the case-based classifier output. The impact of repeatability was studied when aiming to operate at a constant sensitivity or specificity and when aiming to operate at a constant threshold value for the classifier output. The BNN slightly outperformed the LDA with an area under the ROC curve of 0.88 versus 0.85 (p < 0.05). In the repeatability analysis on an individual case basis, it was evident that different cases posed different degrees of difficulty to each classifier as measured by the by-case output variability. When considering the entire data set, however, the overall repeatability of the BNN classifier was lower than for the LDA classifier, i.e., the by-case variability for the BNN was higher. The dependence of the by-case variability on the average by-case classifier output was markedly different for the classifiers. The BNN achieved the lowest variability (best repeatability) when operating at high sensitivity (> 90%) and low specificity (< 66%), while the LDA achieved this at moderate sensitivity (approximately 74%) and specificity (approximately 84%). When operating at constant 90% sensitivity or constant 90% specificity, the width of the 95% confidence intervals for the corresponding classifier output was considerable for both classifiers and increased for smaller sample sizes. When operating at a constant threshold value for the classifier output, the width of the 95% confidence intervals for the corresponding sensitivity and specificity ranged from 9 percentage points (pp) to 30 pp. The repeatability of the classifier output can have a substantial effect on the obtained sensitivity and specificity. Knowledge of classifier repeatability, in addition to overall performance level, is important for successful translation and implementation of computer-aided diagnosis in clinical decision making.
Read full abstract