Abstract
This paper reports the zyy1510 team’s work in the International Workshop on Semantic Evaluation (SemEval-2020) shared task on Sentiment analysis for Code-Mixed (Hindi-English, English-Spanish) Social Media Text. The purpose of this task is to determine the polarity of the text, dividing it into one of the three labels positive, negative and neutral. To achieve this goal, we propose an ensemble model of word n-grams-based Multinomial Naive Bayes (MNB) and sub-word level representations in LSTM (Sub-word LSTM) to identify the sentiments of code-mixed data of Hindi-English and English-Spanish. This ensemble model combines the advantage of rich sequential patterns and the intermediate features after convolution from the LSTM model, and the polarity of keywords from the MNB model to obtain the final sentiment score. We have tested our system on Hindi-English and English-Spanish code-mixed social media data sets released for the task. Our model achieves the F1 score of 0.647 in the Hindi-English task and 0.682 in the English-Spanish task, respectively.
Highlights
Mixing language, known as code-mixing, is a norm in multilingual societies
Social media code-mixed texts generally have three forms: i) Mixed script: a combination of the native-Roman script; ii) Code-Mixed script: a script written in Roman script in native and English languages; iii) Native script: local languages written in native languages
Beyond some of the challenges of general sentiment analysis, code-mixed texts have some unseen difficulties in natural language processing (NLP) tasks
Summary
Known as code-mixing, is a norm in multilingual societies. Many multilingual people tend to be code-mixed by using English-based speech types and the insertion of English into their main language (Patwa et al, 2020), which share their views on social media by combining local and English languages, creating lots of code-mixed text such as Hindi-English and English-Spanish (Ramanarayanan and Suendermann-Oeft, 2017). Social media code-mixed texts generally have three forms: i) Mixed script: a combination of the native-Roman script; ii) Code-Mixed script: a script written in Roman script in native and English languages; iii) Native script: local languages written in native languages. This type of text needs to be handled differently, which is very different from traditional English texts (Prabhu et al, 2016). Beyond some of the challenges of general sentiment analysis, code-mixed texts have some unseen difficulties in natural language processing (NLP) tasks. The implementation of our system is made available via Github
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.