Abstract

This paper reports the zyy1510 team’s work in the International Workshop on Semantic Evaluation (SemEval-2020) shared task on Sentiment analysis for Code-Mixed (Hindi-English, English-Spanish) Social Media Text. The purpose of this task is to determine the polarity of the text, dividing it into one of the three labels positive, negative and neutral. To achieve this goal, we propose an ensemble model of word n-grams-based Multinomial Naive Bayes (MNB) and sub-word level representations in LSTM (Sub-word LSTM) to identify the sentiments of code-mixed data of Hindi-English and English-Spanish. This ensemble model combines the advantage of rich sequential patterns and the intermediate features after convolution from the LSTM model, and the polarity of keywords from the MNB model to obtain the final sentiment score. We have tested our system on Hindi-English and English-Spanish code-mixed social media data sets released for the task. Our model achieves the F1 score of 0.647 in the Hindi-English task and 0.682 in the English-Spanish task, respectively.

Highlights

  • Mixing language, known as code-mixing, is a norm in multilingual societies

  • Social media code-mixed texts generally have three forms: i) Mixed script: a combination of the native-Roman script; ii) Code-Mixed script: a script written in Roman script in native and English languages; iii) Native script: local languages written in native languages

  • Beyond some of the challenges of general sentiment analysis, code-mixed texts have some unseen difficulties in natural language processing (NLP) tasks

Read more

Summary

Introduction

Known as code-mixing, is a norm in multilingual societies. Many multilingual people tend to be code-mixed by using English-based speech types and the insertion of English into their main language (Patwa et al, 2020), which share their views on social media by combining local and English languages, creating lots of code-mixed text such as Hindi-English and English-Spanish (Ramanarayanan and Suendermann-Oeft, 2017). Social media code-mixed texts generally have three forms: i) Mixed script: a combination of the native-Roman script; ii) Code-Mixed script: a script written in Roman script in native and English languages; iii) Native script: local languages written in native languages. This type of text needs to be handled differently, which is very different from traditional English texts (Prabhu et al, 2016). Beyond some of the challenges of general sentiment analysis, code-mixed texts have some unseen difficulties in natural language processing (NLP) tasks. The implementation of our system is made available via Github

Related Work
Dataset
System Description
Experiments detail
Findings
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call