Abstract

Background: Evaluating the sentiments of tweets, blogs, comments and posts have become a crucial part of many applications. Sentiment analysis of social network data is very helpful for decision-making in application areas like movie reviews, product feedback and impact of the speech of a politician etc. The users often comment in their native languages or in slang languages or more often they use abbreviations and do not even stick to grammatical rules of the language. The bilingual and multilingual community often mixes two or more than two languages in their comments. Unavailability of annotated code-mixed data for native language adds to the difficulty in performing sentiment analysis. Objectives: The motive of this article is to present the process of creating annotated corpus for code mixed social media text in Hindi, English and Hinglish which is collected from twitter. Ambiguous meaning words and inconsistent spellings of both the languages have also been included in the study to provide wide spread canvas. Method: This study will provide significant elements that should be considered while developing the annotated corpus of Hindi, Hinglish & English dataset. The annotation is calculated on the basis of polarity of words in three categories as positive, negative and neutral. There are words which have mixed feeling i.e. these words have positive as well as negative sentiments. To consider these words, inner agreement among the polarities has been considered. The words used for sarcasm or slangs have also been taken into account. The study has included ambiguous meaning and inconsistent spelling words of both languages as well. Findings: The proposed work provides a standard annotated corpus for code switched social media text in Hindi-English (Hinglish). The process of developing the corpus and calculating the polarity has been shown. It is found that if one considers the code-mixed text, the accuracy can be enhanced. Application: The proposed corpus can be utilized in the area of market analysis, customer behavior, polling analysis, brand monitoring, etc. The corpus serves as dataset which can further be extended according to the problem definition.Keywords: Machine learning; sentiment analysis; data preprocessing; data cleaning; code-switch; linguistic-switching; multilingual

Highlights

  • With the up-growing usage of digital world where users can access the data on large screen terminals to small screen terminals [1], [2]

  • This paper focuses on conflicting words through inner annotation agreement and on the properties and statistics of dataset

  • This study presented the method of developing a corpus for Hindi & Hinglish sentiment words

Read more

Summary

Introduction

With the up-growing usage of digital world where users can access the data on large screen terminals to small screen terminals [1], [2]. The use of micro-blogging sites like Twitter, Instagram etc. The users can freely share their thoughts and their views on real-world activities like any popular political event such as the elections, corruptions or about a celebrity performing for a charitable event, supporting a political party or anything which may attract their attention. The real-world communities are being replicated by these social networking communities. The real world events are being replicated by these social network sites events. The increasing use of social networking sites has flourished the following important aspects of research-. Sentiment analysis of social network data is very helpful for decision-making in application areas like movie reviews, product feedback and impact of the speech of a politician etc. Unavailability of annotated code-mixed data for native language adds to the difficulty in performing sentiment analysis

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call