Abstract

Recent studies have shown that deep learning is capable of classifying dermatoscopic images at least as well as dermatologists. However, many studies in skin cancer classification utilize non-biopsy-verified training images. This imperfect ground truth introduces a systematic error, but the effects on classifier performance are currently unknown. Here, we systematically examine the effects of label noise by training and evaluating convolutional neural networks (CNN) with 804 images of melanoma and nevi labeled either by dermatologists or by biopsy. The CNNs are evaluated on a test set of 384 images by means of 4-fold cross validation comparing the outputs with either the corresponding dermatological or the biopsy-verified diagnosis. With identical ground truths of training and test labels, high accuracies with 75.03% (95% CI: 74.39–75.66%) for dermatological and 73.80% (95% CI: 73.10–74.51%) for biopsy-verified labels can be achieved. However, if the CNN is trained and tested with different ground truths, accuracy drops significantly to 64.53% (95% CI: 63.12–65.94%, p < 0.01) on a non-biopsy-verified and to 64.24% (95% CI: 62.66–65.83%, p < 0.01) on a biopsy-verified test set. In conclusion, deep learning methods for skin cancer classification are highly sensitive to label noise and future work should use biopsy-verified training images to mitigate this problem.

Highlights

  • Deep learning (DL) has revolutionized non-medical image analysis and is starting to change clinical workflows

  • To obtain the labels based on the majority decision of several dermatologists, we sent six electronic questionnaires each containing 134 images of nevi and melanoma to nine German university hospitals

  • The majority decision of dermatologists correctly classified 261 melanomas, 141 melanomas were wrongly classified as nevus

Read more

Summary

Introduction

Deep learning (DL) has revolutionized non-medical image analysis and is starting to change clinical workflows. Based on a large amount of input data and the corresponding class labels, the parameters of a neural network are optimized during the training phase in such a way that for an unknown input the predicted output ideally corresponds to the true class label. Both the input and the class labels are generally noisy, whereby the so-called feature noise has less dramatic effects on the classification quality than the label noise (7). There are high inter- and intrarater variabilities in many medical classification tasks, which increase label noise

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.