Code-mixed Text Research Articles

People in the modern digital era are increasingly embracing social media platforms to express their concerns and emotions in the form of reviews or comments. While positive interactions within diverse communities can considerably enhance confidence, it is critical to recognize that negative comments can hurt people’s reputations and well-being. Currently, individuals tend to express their thoughts in their native languages on these platforms, which is quite challenging due to potential syntactic ambiguity in these languages. Most of the research has been conducted for resource-aware languages like English. However, low-resource languages such as Urdu, Arabic, and Hindi present challenges due to limited linguistic resources, making information extraction labor-intensive. This study concentrates on code-mixed languages, including three types of text: English, Roman Urdu, and their combination. This study introduces robust transformer-based algorithms to enhance sentiment prediction in code-mixed text, which is a combination of Roman Urdu and English in the same context. Unlike conventional deep learning-based models, transformers are adept at handling syntactic ambiguity, facilitating the interpretation of semantics across various languages. We used state-of-the-art transformer-based models like Electra, code-mixed BERT (cm-BERT), and Multilingual Bidirectional and Auto-Regressive Transformers (mBART) to address sentiment prediction challenges in code-mixed tweets. Furthermore, results reveal that mBART outperformed the Electra and cm-BERT models for sentiment prediction in code-mixed text with an overall F1-score of 0.73. In addition to this, we also perform topic modeling to uncover shared characteristics within the corpus and reveal patterns and commonalities across different classes.

Read full abstract

Due to the increasing reliance on social network platforms in recent years, hate speech has risen significantly among online users. Government and social media platforms face the challenging responsibility of controlling, detecting, and removing massively growing hateful content as early as possible to prevent future criminal acts, such as cyberviolence and real-life hate crimes. Twitter is used globally by people from various backgrounds and nationalities; it contains tweets posted in different languages, including code-mixed language, such as Hindi–English. Due to the informal format of tweets with variations in spelling and grammar, hate speech detection is especially challenging in code-mixed text. In this paper, we tackle the critical issue of hate speech detection on social media, with a focus on a mix of English and Hindi–English (code-mixed) text messages on Twitter. More specifically, we aim to evaluate the impact of data pre-processing on hate speech detection. Our method first performs 10-step data cleansing; then, it builds a detection method based on two architectures, namely a convolutional neural network (CNN) and a combination of CNN and long short-term Memory (LSTM) algorithms. We tune the hyperparameters of the proposed model architectures and conduct extensive experimental analysis on real-life tweets to evaluate the performance of the models in terms of accuracy, efficiency, and scalability. Moreover, we compare our method with a closely related hate speech detection method from the literature. The experimental results suggest that our method results in an improved accuracy and a significantly improved runtime. Among our best-performing models, CNN-LSTM improved accuracy by nearly 2% and decreased the runtime by almost half.

Read full abstract

Code-mixed Text Research Articles

Related Topics

Articles published on Code-mixed Text

A survey on NLP tasks, resources and techniques for low-resource Telugu-English code-mixed text

Use of prompt-based learning for code-mixed and code-switched text classification

Predicting multi-label emojis, emotions, and sentiments in code-mixed texts using an emojifying sentiments framework

Abusive Comment Detection in Tamil Code-Mixed Data by Adjusting Class Weights and Refining Features

Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data

Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text

Augmenting sentiment prediction capabilities for code-mixed tweets with multilingual transformers

Consensus-Based Machine Translation for Code-Mixed Texts

Sentiment Analysis of Code-Mixed Text: A Comprehensive Review

Language Identification and Transliteration approaches for Code-Mixed Text

Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning

Word Level Language Identification in Indonesian-Javanese-English Code-Mixed Text

English to Arabic Braille Neural Machine Translation Through Corpus Augmentation

Sarcasm Detection in Indonesian-English Code-Mixed Text Using Multihead Attention-Based Convolutional and Bi-Directional GRU

Analysing Code-Mixed Text in Programming Instruction Through Machine Learning for Feature Extraction

A feature fusion and detection approach using deep learning for sentimental analysis and offensive text detection from code-mix Malayalam language

Language augmentation approach for code-mixed text classification

AdapterFusion-based multi-task learning for code-mixed and code-switched text classification

The Impact of Data Pre-Processing on Hate Speech Detection in a Mix of English and Hindi–English (Code-Mixed) Tweets

Sentimental analysis & Hate speech detection on English and German text collected from social media platforms using optimal feature extraction and hybrid diagonal gated recurrent neural network

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Code-mixed Text Research Articles

Related Topics

Articles published on Code-mixed Text

A survey on NLP tasks, resources and techniques for low-resource Telugu-English code-mixed text

Use of prompt-based learning for code-mixed and code-switched text classification

Predicting multi-label emojis, emotions, and sentiments in code-mixed texts using an emojifying sentiments framework

Abusive Comment Detection in Tamil Code-Mixed Data by Adjusting Class Weights and Refining Features

Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data

Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text

Augmenting sentiment prediction capabilities for code-mixed tweets with multilingual transformers

Consensus-Based Machine Translation for Code-Mixed Texts

Sentiment Analysis of Code-Mixed Text: A Comprehensive Review

Language Identification and Transliteration approaches for Code-Mixed Text

Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning

Word Level Language Identification in Indonesian-Javanese-English Code-Mixed Text

English to Arabic Braille Neural Machine Translation Through Corpus Augmentation

Sarcasm Detection in Indonesian-English Code-Mixed Text Using Multihead Attention-Based Convolutional and Bi-Directional GRU

Analysing Code-Mixed Text in Programming Instruction Through Machine Learning for Feature Extraction

A feature fusion and detection approach using deep learning for sentimental analysis and offensive text detection from code-mix Malayalam language

Language augmentation approach for code-mixed text classification

AdapterFusion-based multi-task learning for code-mixed and code-switched text classification

The Impact of Data Pre-Processing on Hate Speech Detection in a Mix of English and Hindi–English (Code-Mixed) Tweets

Sentimental analysis & Hate speech detection on English and German text collected from social media platforms using optimal feature extraction and hybrid diagonal gated recurrent neural network