Abstract

Crowdsourcing makes it much faster and cheaper to obtain labels for a large amount of data used in supervised learning. In the crowdsourcing scenario, an integrated label is inferred from a multiple noisy label set for each instance using ground truth inference algorithms, which is called label integration. However, a certain level of label noise remains in the integrated dataset, which degrades the performance of the models trained on it. To the best of our knowledge, existing label noise correction algorithms only use the original attribute space and do not use the information contained in the multiple noisy label sets for building models. To solve these problems, we propose a novel integrated label noise correction algorithm called co-training-based noise correction (CTNC). In CTNC, the weight is first calculated from the information provided by the multiple noisy label set for each instance. Subsequently, a label noise filter is used to identify noisy instances; a clean set and a noisy set are thus obtained. Another attribute view of each instance in both the clean and noisy sets is then generated by the classifiers trained on the original attribute view of the clean set. Finally, a co-training framework is used to train two classifiers to relabel the integrated instances. The performance on 34 simulated datasets and 2 real-world datasets demonstrates that our proposed CTNC outperforms all state-of-the-art label noise correction algorithms used for comparison.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call