Abstract

The use of code-mixed languages (written in Roman character) on social media platforms is prevalent in multilingual nations. Translation from code-mixed to monolingual is necessary for social media analysis, content filtering, and targeted advertising. Training translation models from scratch is difficult due to the scarcity of available code-mixed resources and the extremely noisy nature of real-time code-mixed sentences. At the moment, multilingual state-of-the-art language models are routinely used for multilingual applications. However, multilingual models are ineffective in handling code-mixed sentences as it is usually written in Roman script but contain words from at least two languages. In the paper, two data augmentation techniques are proposed to improve code-mixed to monolingual translation, one based on script augmentation and the other on code-mixed sentence generation. The proposed approach converts the code-mixed sentences into ‘Mixed Script form’ that restore the native language words in the sentences with corresponding native language scripts. The novelty of the work is that the multilingual language models include each language’s linguistic competence, preserving context in the monolingual sentences, not possible in the earlier models. Using an mT5 model, denoising and mixed-script switching are performed, followed by monolingual translation with another mT5 model. Code-mixed sentences are generated by employing a simple code-mixed sentence generating technique using monolingual parallel inputs. Two different Indic language sets, namely Hindi-English and Bengali-English are applied and in each case, the proposed approach outperforms straight uni-script (Roman) code-mixed to monolingual translation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call