Abstract

We study the problem of selecting a subset of weakly labeled data where the labels of each data instance are redundant and imperfect. In real applications, less-than-expert labels are obtained at low cost in order to acquire many labels for each instance and then used for estimating the ground truth. However, on one side, preparing and processing data itself sometimes can be even more expensive than labeling. On the other side, noisy labels also decrease the performance of supervised learning methods. Thus, we introduce a new quality control mechanism on labels for each instance and use it to select an optimal subset of data. Based on the quality control mechanism, in which the labeling quality of each instance is estimated, it provides a way to know which instance has enough reliable labels or how many labels still need to be collected for a data instance. In this paper, first, we consider the data subset selection problem under the probably approximately correct model. Then, we show how to find an ϵ -optimal labeled instance based on expected labeling quality. Furthermore, we propose new algorithms to select the best k quality instances that have high expected labeling quality. Using a reliable subset of data provides substantial benefit over using all data with imperfect multiple labels, and the expected labeling quality is a good indicator of where to allocate labeling effort. It shows how many labels should be acquired for an instance and which instances are qualified to be selected comparing with others. Both the theoretical guarantees and the comprehensive experiments demonstrate the effectiveness and efficiency of our algorithms.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.