Abstract
Code-mixing is a common phenomenon in multilingual societies around the world and is especially common in social media texts. Traditional NLP systems, usually trained on monolingual corpora, do not perform well on code-mixed texts. Training specialized models for code-switched texts is difficult due to the lack of large-scale datasets. Translating code-mixed data into standard languages like English could improve performance on various code-mixed tasks since we can use transfer learning from state-of-the-art English models for processing the translated data. This paper focuses on two sequence-level classification tasks for English-Hindi code mixed texts, which are part of the GLUECoS benchmark - Natural Language Inference and Sentiment Analysis. We propose using various pre-trained models that have been fine-tuned for similar English-only tasks and have shown state-of-the-art performance. We further fine-tune these models on the translated code-mixed datasets and achieve state-of-the-art performance in both tasks. To translate English-Hindi code-mixed data to English, we use mBART, a pre-trained multilingual sequence-to-sequence model that has shown competitive performance on various low-resource machine translation pairs and has also shown performance gains in languages that were not in its pre-training corpus.
Highlights
Most research uses infor- guage processing and discuss work related to mal sources such as social media texts or messages, machine translation, Natural language Inference, which are usually challenging to obtain
Individuals generally for code-mixed sequence level classification tasks provide a rough phonetic transcription of the in- on the chosen tasks - Natural Language Inference tended word, which can vary from individual to and Sentiment Analysis and show its performance individual due to any number of factors, including against past work
We achieve state-of-the-art performance on In contrast, neural machine translation gained two classification tasks of the GLUECoS popularity in the last decade after Kalchbrenner benchmark - Natural Language Inference and and Blunsom (2013) successfully proposed the Sentiment Analysis with an absolute increase first DNN model for translation
Summary
We describe our proposed model, which uses mBART (Liu et al, 2020) to translate code-mixed texts to English. We fine-tune mBART, which is a multilingual sequence-to-sequence denoising auto-encoder It has been pre-trained using the BART (Lewis et al, 2020) objective on large-scale monolingual corpora of 25 languages extracted from Common Crawl (Wenzek et al, 2020; Conneau et al, 2020). There has been extensive research on sentiment analysis of English texts with various shared tasks and and Srivastava and Singh (2020), the statistics of the datasets are provided in the Table 1. Since both the datasets contain Hindi words in Roman script, we use the CSNLI library (Bhat et al, 2017, 2018) as a preprocessing step. We use the training set, which contains 1,609,682 sentences, for training our systems
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have