Multilingual Twitter Sentiment Classification: The Role of Human Annotators.

Igor Mozetič,Miha Grčar,Jasmina Smailović,Matjaz Perc

doi:10.1371/journal.pone.0155036

Abstract

What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive) as ordered.

Highlights

Sentiment analysis is a form of shallow semantic analysis of texts
What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models
We summarize the published labeled sets used for training the classification models, and the machine learning methods applied for training

Summary

Introduction

Sentiment analysis is a form of shallow semantic analysis of texts. Its goal is to extract opinions, emotions or attitudes towards different objects of interest [1, 2]. From the first approaches in 2000s, sentiment analysis gained considerable attention with massive growth of the web and social media. There are two prevailing approaches to large-scale sentiment analysis: (i) lexicon-based and (ii) machine learning. A sentiment classification model is constructed first, from a large set of sentiment labeled texts, and applied to the stream of unlabelled texts. The model has the form of a function that maps features extracted from the text into sentiment labels (which typically have discrete values: negative, neutral, or positive). In both approaches, one needs a considerable involvement of humans, at least initially. Humans have to label their perception of the sentiment expressed either in individual

Methods

Results

Conclusion