Learning from crowds with sparse and imbalanced annotations

Ye Shi,Sheng-Jun Huang,Shao-Yuan Li

doi:10.1007/s10994-022-06185-w

Abstract

Traditional supervised learning requires ground truth labels for training, whose collection however is difficult in many cases. Recently, crowdsourcing has established itself as an efficient labeling solution by resorting to non-expert crowds. To reduce the labeling error effects, one common practice is to distribute each instance to multiple workers, whereas each worker only annotates a subset of data, resulting in the sparse annotation phenomenon. In this paper, we show that when meeting with class-imbalance, i.e., even when the groundtruth labels are slightly imbalanced, the sparse annotations are prone to be skewly distributed and would bias the learning algorithm severely. To combat this issue, we propose one Distribution Aware Self-training based Crowdsourcing learning (DASC) approach, which supplements the sparse annotations by adding confident pseudo-annotations and at the same time re-balancing the annotation distribution. Specifically, we propose one distribution aware confidence measure to select the most confident pseudo-annotations, with minority/majority classes selected more/less frequently. As a universal framework, DASC is applicable to various crowdsourcing methods for consistent performance gains. We conduct extensive experiments over real-world crowdsourcing benchmarks, from slight to heavy imbalance ratio, with various annotation sparsity levels, and show that DASC substantially improves previous crowdsourcing models by \(2\%\)-\(20\%\) absolute test accuracy, and yields much more balanced annotations.

Full Text