Deep learning technology is widely used in the field of medical imaging. Among them, Convolutional Neural Networks (CNNs) are the most widely used, and the quality of the dataset is crucial for the training of CNN diagnostic models, as mislabeled data can easily affect the accuracy of the diagnostic models. However, due to medical specialization, it is difficult for non-professional physicians to judge mislabeled medical image data. In this paper, we proposed a new framework named medical image dataset cleaning (MIDC), whose main contribution is to improve the quality of public datasets by automatically cleaning up mislabeled data. The main innovations of MIDC are: firstly, the framework innovatively utilizes multiple public datasets of the same disease, relying on different CNNs to automatically recognize images and remove mislabeled data to complete the data cleaning process. This process does not rely on annotations from professional physicians and does not require additional datasets with more reliable labels; Secondly, a novel grading rule is designed to divide the datasets into high-accuracy datasets and low-accuracy datasets, based on which the data cleaning process can be performed; Thirdly, a novel data cleaning module based on CNN is designed to identify and clean low-accuracy datasets by using high-accuracy datasets. In the experiments, the validity of the proposed framework was verified by using four kinds of datasets diabetic retinal, viral pneumonia, breast tumor, and skin cancer, with results showing an increase in the average diagnostic accuracy from 71.18 % to 85.13 %, 82.50 %to 93.79 %, 85.59 %to 93.45 %, and 84.55 %to 94.21 %, respectively. The proposed data cleaning framework MIDC could better help physicians diagnose diseases based on the dataset with mislabeled data.
Read full abstract