Abstract

The analysis of crowdsourced annotations in natural language processing is concerned with identifying (1) gold standard labels, (2) annotator accuracies and biases, and (3) item difficulties and error patterns. Traditionally, majority voting was used for 1, and coefficients of agreement for 2 and 3. Lately, model-based analysis of corpus annotations have proven better at all three tasks. But there has been relatively little work comparing them on the same datasets. This paper aims to fill this gap by analyzing six models of annotation, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items. We evaluate these models along four aspects: comparison to gold labels, predictive accuracy for new annotations, annotator characterization, and item difficulty, using four datasets with varying degrees of noise in the form of random (spammy) annotators. We conclude with guidelines for model selection, application, and implementation.

Highlights

  • We evaluate these models along four aspects: comparison to gold labels, predictive accuracy for new annotations, annotator characterization, and item difficulty, using four datasets with varying degrees of noise in the form of random annotators

  • The standard methodology for analyzing crowdsourced data in NLP is based on majority voting and inter-annotator coefficients of agreement, such as Cohen’s κ (Artstein and Poesio, 2008)

  • The results are compared with those obtained with a simple majority vote (MAJVOTE) baseline

Read more

Summary

Introduction

The standard methodology for analyzing crowdsourced data in NLP is based on majority voting (selecting the label chosen by the majority of coders) and inter-annotator coefficients of agreement, such as Cohen’s κ (Artstein and Poesio, 2008). Research suggests that models of annotation can solve these problems of standard practices when applied to crowdsourcing (Dawid and Skene, 1979; Smyth et al, 1995; Raykar et al, 2010; Hovy et al, 2013; Passonneau and Carpenter, 2014). Such probabilistic approaches allow us to characterize the accuracy of the annotators and correct for their bias, as well as accounting for item-level effects. The literature comparing models of annotation is limited, focused exclusively on synthetic data (Quoc Viet Hung et al, 2013) or using publicly available implementations that constrain the experiments almost exclusively to binary annotations (Sheshadri and Lease, 2013)

Objectives
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.