Code-mixed Text Research Articles

Background: Evaluating the sentiments of tweets, blogs, comments and posts have become a crucial part of many applications. Sentiment analysis of social network data is very helpful for decision-making in application areas like movie reviews, product feedback and impact of the speech of a politician etc. The users often comment in their native languages or in slang languages or more often they use abbreviations and do not even stick to grammatical rules of the language. The bilingual and multilingual community often mixes two or more than two languages in their comments. Unavailability of annotated code-mixed data for native language adds to the difficulty in performing sentiment analysis. Objectives: The motive of this article is to present the process of creating annotated corpus for code mixed social media text in Hindi, English and Hinglish which is collected from twitter. Ambiguous meaning words and inconsistent spellings of both the languages have also been included in the study to provide wide spread canvas. Method: This study will provide significant elements that should be considered while developing the annotated corpus of Hindi, Hinglish & English dataset. The annotation is calculated on the basis of polarity of words in three categories as positive, negative and neutral. There are words which have mixed feeling i.e. these words have positive as well as negative sentiments. To consider these words, inner agreement among the polarities has been considered. The words used for sarcasm or slangs have also been taken into account. The study has included ambiguous meaning and inconsistent spelling words of both languages as well. Findings: The proposed work provides a standard annotated corpus for code switched social media text in Hindi-English (Hinglish). The process of developing the corpus and calculating the polarity has been shown. It is found that if one considers the code-mixed text, the accuracy can be enhanced. Application: The proposed corpus can be utilized in the area of market analysis, customer behavior, polling analysis, brand monitoring, etc. The corpus serves as dataset which can further be extended according to the problem definition.Keywords: Machine learning; sentiment analysis; data preprocessing; data cleaning; code-switch; linguistic-switching; multilingual

Read full abstract

PurposeNormalization is an important step in all the natural language processing applications that are handling social media text. The text from social media poses a different kind of problems that are not present in regular text. Recently, a considerable amount of work has been done in this direction, but mostly in the English language. People who do not speak English code mixed the text with their native language and posted text on social media using the Roman script. This kind of text further aggravates the problem of normalizing. This paper aims to discuss the concept of normalization with respect to code-mixed social media text, and a model has been proposed to normalize such text.Design/methodology/approachThe system is divided into two phases – candidate generation and most probable sentence selection. Candidate generation task is treated as machine translation task where the Roman text is treated as source language and Gurmukhi text is treated as the target language. Character-based translation system has been proposed to generate candidate tokens. Once candidates are generated, the second phase uses the beam search method for selecting the most probable sentence based on hidden Markov model.FindingsCharacter error rate (CER) and bilingual evaluation understudy (BLEU) score are reported. The proposed system has been compared with Akhar software and RB\_R2G system, which are also capable of transliterating Roman text to Gurmukhi. The performance of the system outperforms Akhar software. The CER and BLEU scores are 0.268121 and 0.6807939, respectively, for ill-formed text.Research limitations/implicationsIt was observed that the system produces dialectical variations of a word or the word with minor errors like diacritic missing. Spell checker can improve the output of the system by correcting these minor errors. Extensive experimentation is needed for optimizing language identifier, which will further help in improving the output. The language model also seeks further exploration. Inclusion of wider context, particularly from social media text, is an important area that deserves further investigation.Practical implicationsThe practical implications of this study are: (1) development of parallel dataset containing Roman and Gurmukhi text; (2) development of dataset annotated with language tag; (3) development of the normalizing system, which is first of its kind and proposes translation based solution for normalizing noisy social media text from Roman to Gurmukhi. It can be extended for any pair of scripts. (4) The proposed system can be used for better analysis of social media text. Theoretically, our study helps in better understanding of text normalization in social media context and opens the doors for further research in multilingual social media text normalization.Originality/valueExisting research work focus on normalizing monolingual text. This study contributes towards the development of a normalization system for multilingual text.

Read full abstract

Code-mixed Text Research Articles

Related Topics

Articles published on Code-mixed Text

Hatred and trolling detection transliteration framework using hierarchical LSTM in code-mixed social media text

Characterization and mechanical properties of offensive language taxonomy and detection techniques

Zera-Shot Sentiment Analysis for Code-Mixed Data

A Pun Identification Framework for Retrieving Equivocation Terms based on HLSTM Learning Model

Neural Network Pun Material Identifiction Framework based on Artificial Intelligence Learning

Named Entity Recognition for Code Mixed Social Media Sentences

Normalisation of Indonesian-English Code-Mixed Text and its Effect on Emotion Classification

Word Level Language Identification on Code-Mixed English-Bodo Text

Transformer Based Language Identification for Malayalam-English Code-Mixed Text

Distributional Word Representations for Code-mixed Text in Moroccan Darija

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed social Media Text in English and Roman Hindi

Annotated corpus creation for sentiment analysis in code-mixed Hindi-English (Hinglish) social network data

Roman to Gurmukhi Social Media Text Normalization

Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus

Improving Code-mixed POS Tagging Using Code-mixed Embeddings

Language identification framework in code-mixed social media text based on quantum LSTM — the word belongs to which language?

Current State of Hinglish Text Sentiment Analysis

Detection of Hate Speech Text in Hindi-English Code-mixed Data

Emotion Detection in Hinglish(Hindi+English) Code-Mixed Social Media Text

An effective cybernated word embedding system for analysis and language identification in code-mixed social media text

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Code-mixed Text Research Articles

Related Topics

Articles published on Code-mixed Text

Hatred and trolling detection transliteration framework using hierarchical LSTM in code-mixed social media text

Characterization and mechanical properties of offensive language taxonomy and detection techniques

Zera-Shot Sentiment Analysis for Code-Mixed Data

A Pun Identification Framework for Retrieving Equivocation Terms based on HLSTM Learning Model

Neural Network Pun Material Identifiction Framework based on Artificial Intelligence Learning

Named Entity Recognition for Code Mixed Social Media Sentences

Normalisation of Indonesian-English Code-Mixed Text and its Effect on Emotion Classification

Word Level Language Identification on Code-Mixed English-Bodo Text

Transformer Based Language Identification for Malayalam-English Code-Mixed Text

Distributional Word Representations for Code-mixed Text in Moroccan Darija

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed social Media Text in English and Roman Hindi

Annotated corpus creation for sentiment analysis in code-mixed Hindi-English (Hinglish) social network data

Roman to Gurmukhi Social Media Text Normalization

Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus

Improving Code-mixed POS Tagging Using Code-mixed Embeddings

Language identification framework in code-mixed social media text based on quantum LSTM — the word belongs to which language?

Current State of Hinglish Text Sentiment Analysis

Detection of Hate Speech Text in Hindi-English Code-mixed Data

Emotion Detection in Hinglish(Hindi+English) Code-Mixed Social Media Text

An effective cybernated word embedding system for analysis and language identification in code-mixed social media text