Improving Indic code-mixed to monolingual translation using Mixed Script Augmentation, Generation &amp; Transfer Learning

Rajat Subhra Bhowmick,Jayanta Paul,Jaya Sil,Ananya Paul,Isha Ganguli

doi:10.1145/3606695

Abstract

The use of code-mixed languages (written in Roman character) on social media platforms is prevalent in multilingual nations. Translation from code-mixed to monolingual is necessary for social media analysis, content filtering, and targeted advertising. Training translation models from scratch is difficult due to the scarcity of available code-mixed resources and the extremely noisy nature of real-time code-mixed sentences. At the moment, multilingual state-of-the-art language models are routinely used for multilingual applications. However, multilingual models are ineffective in handling code-mixed sentences as it is usually written in Roman script but contain words from at least two languages. In the paper, two data augmentation techniques are proposed to improve code-mixed to monolingual translation, one based on script augmentation and the other on code-mixed sentence generation. The proposed approach converts the code-mixed sentences into ‘Mixed Script form’ that restore the native language words in the sentences with corresponding native language scripts. The novelty of the work is that the multilingual language models include each language’s linguistic competence, preserving context in the monolingual sentences, not possible in the earlier models. Using an mT5 model, denoising and mixed-script switching are performed, followed by monolingual translation with another mT5 model. Code-mixed sentences are generated by employing a simple code-mixed sentence generating technique using monolingual parallel inputs. Two different Indic language sets, namely Hindi-English and Bengali-English are applied and in each case, the proposed approach outperforms straight uni-script (Roman) code-mixed to monolingual translation.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving Indic code-mixed to monolingual translation using Mixed Script Augmentation, Generation & Transfer Learning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Similar Papers

Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan
...
-
, et. al. ...
01 Aug 2021
01 Aug 2021

Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
...
-
, et. al. ...
01 Aug 2021
01 Aug 2021

Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
...
Zenodo (CERN European Organization for Nuclear Research) | VOL. -
, et. al. ...
10 May 2021
Zenodo (CERN European Organization for Nuclear Research) | VOL. -

Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
Laura Pérez-Mayos ... Alba Táboas García
-
Laura Pérez-Mayos, et. al.Laura Pérez-Mayos ... Alba Táboas García
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving Indic code-mixed to monolingual translation using Mixed Script Augmentation, Generation &amp; Transfer Learning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Improving Indic code-mixed to monolingual translation using Mixed Script Augmentation, Generation & Transfer Learning