Abstract

Transfer learning is commonly used in deep learning for medical imaging to alleviate the problem of limited available data. In this work, we studied the risk of feature leakage and its dependence on sample size when using pretrained deep convolutional neural network (DCNN) as feature extractor for classification breast masses in mammography. Feature leakage occurs when the training set is used for feature selection and classifier modeling while the cost function is guided by the validation performance or informed by the test performance. The high-dimensional feature space extracted from pretrained DCNN suffers from the curse of dimensionality; feature subsets that can provide excessively optimistic performance can be found for the validation set or test set if the latter is allowed for unlimited reuse during algorithm development. We designed a simulation study to examine feature leakage when using DCNN as feature extractor for mass classification in mammography. Four thousand five hundred and seventy-seven unique mass lesions were partitioned by patient into three sets: 3222 for training, 508 for validation, and 847 for independent testing. Three pretrained DCNNs, AlexNet, GoogLeNet, and VGG16, were first compared using a training set in fourfold cross validation and one was selected as the feature extractor. To assess generalization errors, the independent test set was sequestered as truly unseen cases. A training set of a range of sizes from 10% to 75% was simulated by random drawing from the available training set in addition to 100% of the training set. Three commonly used feature classifiers, the linear discriminant, the support vector machine, and the random forest were evaluated. A sequential feature selection method was used to find feature subsets that could achieve high classification performance in terms of the area under the receiver operating characteristic curve (AUC) in the validation set. The extent of feature leakage and the impact of training set size were analyzed by comparison to the performance in the unseen test set. All three classifiers showed large generalization error between the validation set and the independent sequestered test set at all sample sizes. The generalization error decreased as the sample size increased. At 100% of the sample size, one classifier achieved an AUC as high as 0.91 on the validation set while the corresponding performance on the unseen test set only reached an AUC of 0.72. Our results demonstrate that large generalization errors can occur in AI tools due to feature leakage. Without evaluation on unseen test cases, optimistically biased performance may be reported inadvertently, and can lead to unrealistic expectations and reduce confidence for clinical implementation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call