Everyone’s Voice Matters: Quantifying Annotation Disagreement Using Demographic Information

Ruyuan Wan,Jaehyung Kim,Dongyeop Kang

doi:10.1609/aaai.v37i12.26698

Abstract

In NLP annotation, it is common to have multiple annotators label the text and then obtain the ground truth labels based on major annotators’ agreement. However, annotators are individuals with different backgrounds and various voices. When annotation tasks become subjective, such as detecting politeness, offense, and social norms, annotators’ voices differ and vary. Their diverse voices may represent the true distribution of people’s opinions on subjective matters. Therefore, it is crucial to study the disagreement from annotation to understand which content is controversial from the annotators. In our research, we extract disagreement labels from five subjective datasets, then fine-tune language models to predict annotators’ disagreement. Our results show that knowing annotators’ demographic information (e.g., gender, ethnicity, education level), in addition to the task text, helps predict the disagreement. To investigate the effect of annotators’ demographics on their disagreement level, we simulate different combinations of their artificial demographics and explore the variance of the prediction to distinguish the disagreement from the inherent controversy from text content and the disagreement in the annotators’ perspective. Overall, we propose an innovative disagreement prediction mechanism for better design of the annotation process that will achieve more accurate and inclusive results for NLP systems. Our code and dataset are publicly available.

Full Text