Code-switched Text Research Articles

Cross-genre author profiling aims to build generalized models for predicting profile traits of authors that can be helpful across different text genres for computer forensics, marketing, and other applications. The cross-genre author profiling task becomes challenging when dealing with low-resourced languages due to the lack of availability of standard corpora and methods. The task becomes even more challenging when the data is code-switched, which is informal and unstructured. In previous studies, the problem of cross-genre author profiling has been mainly explored for mono-lingual texts in highly resourced languages (English, Spanish, etc.). However, it has not been thoroughly explored for the code-switched text which is widely used for communication over social media. To fulfill this gap, we propose a transfer learning-based solution for the cross-genre author profiling task on code-switched (English–RomanUrdu) text using three widely known genres, Facebook comments/posts, Tweets, and SMS messages. In this article, firstly, we experimented with the traditional machine learning, deep learning and pre-trained transfer learning models (MBERT, XLMRoBERTa, ULMFiT, and XLNET) for the same-genre and cross-genre gender identification task. We then propose a novel Trans-Switch approach that focuses on the code-switching nature of the text and trains on specialized language models. In addition, we developed three RomanUrdu to English translated corpora to study the impact of translation on author profiling tasks. The results show that the proposed Trans-Switch model outperforms the baseline deep learning and pre-trained transfer learning models for cross-genre author profiling task on code-switched text. Further, the experimentation also shows that the translation of RomanUrdu text does not improve results.

Read full abstract

User-generated text in social media communication (SMC) is mainly characterized by non-standard form. It may contain code switching (CS) text, a widespread phenomenon in SMC, in addition to noisy elements used, especially in written conversations (use of abbreviations, symbols, emoticons) or misspelled words. All of these factors constitute a wall in front of text mining applications. Common text mining tools are dedicated to standard use of standard languages but cannot deal with other forms, especially written text in social media. To overcome these problems, in this work we present our solution for the normalization of non-standard use of standard and non-standard languages (dialects) in SMC text with the use of existent resources and tools. The main processing in our solution consists of CS normalization from multiple to one language by the use of a machine translation--like approach. This processing relies on a linguistic approach of CS, which aims at identifying automatically the translation source and target languages (without human intervention). The remaining processing operations concern the normalization of SMC special expressions and spelling correction of out-of-vocabulary words. To preserve the coded-switched sentence meaning across translation, we adopt a knowledge-based approach for word sense translation disambiguation reinforced with a multi-lingual vertical context. All of these processes are embedded in what we refer to as the machine normalization system. Our solution can be used as a front-end of text mining processing, enabling the analysis of SMC noisy text. The conducted experiments show that our system performs better than considered baselines.

Read full abstract

Code-switched Text Research Articles

Related Topics

Articles published on Code-switched Text

A Comparative Study of Transformer-based Models for Hate-Speech Detection in English-Kiswahili Code-Switched Social Media Text

Use of prompt-based learning for code-mixed and code-switched text classification

A novel socio-pragmatic framework for sentiment analysis in Dravidian–English code-switched texts

Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data

AdapterFusion-based multi-task learning for code-mixed and code-switched text classification

The Analysis of the Sepedi-English Code-switched Radio News Corpus

Tran-Switch: A transfer learning approach for sentence level cross-genre author profiling on code-switched English–RomanUrdu Text

Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification

Code-switched end-to-end Marathi speech recognition for especially abled people

COVID-19 and cyberbullying: deep ensemble model to identify cyberbullying from code-switched languages during the pandemic.

Psychosocial Features for Hate Speech Detection in Code-switched Texts

Leveraging bilingual-view parallel translation for code-switched emotion detection with adversarial dual-channel encoder

Novel textual features for language modeling of intra-sentential code-switching data

Machine Normalization

Fine-Tuning BERT for Multi-Label Sentiment Analysis in Unbalanced Code-Switching Text

Predicting the emergence of content words in L2 diary entries during study abroad over a year

GIRNet: Interleaved Multi-Task Recurrent State Sequence Models

IITG-HingCoS corpus: A Hinglish code-switching database for automatic speech recognition

Emotion Analysis in Code-Switching Text With Joint Factor Graph Model

Syntactic and Semantic Features For Code-Switching Factored Language Models

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Code-switched Text Research Articles

Related Topics

Articles published on Code-switched Text

A Comparative Study of Transformer-based Models for Hate-Speech Detection in English-Kiswahili Code-Switched Social Media Text

Use of prompt-based learning for code-mixed and code-switched text classification

A novel socio-pragmatic framework for sentiment analysis in Dravidian–English code-switched texts

Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data

AdapterFusion-based multi-task learning for code-mixed and code-switched text classification

The Analysis of the Sepedi-English Code-switched Radio News Corpus

Tran-Switch: A transfer learning approach for sentence level cross-genre author profiling on code-switched English–RomanUrdu Text

Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification

Code-switched end-to-end Marathi speech recognition for especially abled people

COVID-19 and cyberbullying: deep ensemble model to identify cyberbullying from code-switched languages during the pandemic.

Psychosocial Features for Hate Speech Detection in Code-switched Texts

Leveraging bilingual-view parallel translation for code-switched emotion detection with adversarial dual-channel encoder

Novel textual features for language modeling of intra-sentential code-switching data

Machine Normalization

Fine-Tuning BERT for Multi-Label Sentiment Analysis in Unbalanced Code-Switching Text

Predicting the emergence of content words in L2 diary entries during study abroad over a year

GIRNet: Interleaved Multi-Task Recurrent State Sequence Models

IITG-HingCoS corpus: A Hinglish code-switching database for automatic speech recognition

Emotion Analysis in Code-Switching Text With Joint Factor Graph Model

Syntactic and Semantic Features For Code-Switching Factored Language Models