Cross-lingual sentiment classification in low-resource Bengali language

Salim Sazzed

doi:10.18653/v1/2020.wnut-1.8

Abstract

Sentiment analysis research in low-resource languages such as Bengali is still unexplored due to the scarcity of annotated data and the lack of text processing tools. Therefore, in this work, we focus on generating resources and showing the applicability of the cross-lingual sentiment analysis approach in Bengali. For benchmarking, we created and annotated a comprehensive corpus of around 12000 Bengali reviews. To address the lack of standard text-processing tools in Bengali, we leverage resources from English utilizing machine translation. We determine the performance of supervised machine learning (ML) classifiers in machine-translated English corpus and compare it with the original Bengali corpus. Besides, we examine sentiment preservation in the machine-translated corpus utilizing Cohen’s Kappa and Gwet’s AC1. To circumvent the laborious data labeling process, we explore lexicon-based methods and study the applicability of utilizing cross-domain labeled data from the resource-rich language. We find that supervised ML classifiers show comparable performances in Bengali and machine-translated English corpus. By utilizing labeled data, they achieve 15%-20% higher F1 scores compared to both lexicon-based and transfer learning-based methods. Besides, we observe that machine translation does not alter the sentiment polarity of the review for most of the cases. Our experimental results demonstrate that the machine translation based cross-lingual approach can be an effective way for sentiment classification in Bengali.

Highlights

Sentiment analysis classifies the semantic orientation of a text
We assess the agreement of the predictions of various supervised machine learning (ML) classifiers in Bengali and machine-translated English corpus utilizing Cohen’s kappa and Gwet’s AC1 statistics
We provide the comparative performances of ML classifiers in Bengali and machinetranslated English corpus and agreement of the predictions

Summary

Introduction

Sentiment analysis classifies the semantic orientation of a text. Researchers identified sentiment orientations of the text in various levels, such as document, sentence, or aspect. Researchers employed both the machine learning-based and lexicon-based approaches for sentiment analysis. Utilizing labeled data, supervised ML classifiers such as Naive Bayes (NB), Maximum Entropy (ME), Support Vector Machines (SVM), etc. (Pang et al, 2002; Gamon, 2004) and deep learning-based classifiers (Abdi et al, 2019; Araque et al, 2017) have been employed by the researchers for sentiment classification. Though the lexicon-based methods (Turney, 2002) do not require labeled data, they suffer from the lexicon coverage problem and are not robust to deal with the ambiguity and linguistic variations of natural languages

Methods

Results

Discussion

Conclusion