Abstract
This paper describes the system submitted by IITP-MT team to Computational Approaches to Linguistic Code-Switching (CALCS 2021) shared task on MT for English→Hinglish. We submit a neural machine translation (NMT) system which is trained on the synthetic code-mixed (cm) English-Hinglish parallel corpus. We propose an approach to create code-mixed parallel corpus from a clean parallel corpus in an unsupervised manner. It is an alignment based approach and we do not use any linguistic resources for explicitly marking any token for code-switching. We also train NMT model on the gold corpus provided by the workshop organizers augmented with the generated synthetic code-mixed parallel corpus. The model trained over the generated synthetic cm data achieves 10.09 BLEU points over the given test set.
Highlights
We introduce an alignment based unsupervised approach for generating code-mixed data from par-In this paper, we describe our submission to shared allel corpus which can be used to train the neural machine translation (NMT) task on Machine Translation (MT) for English → model for code-mixed text translation.Hinglish at CALCS 2021
We describe our submission to shared allel corpus which can be used to train the NMT
Submit an NMT system which is trained on the par- In section 3, we describe our approach to generate allel code-mixed English-Hinglish synthetic corpus. synthetic code-mixed corpus along with the sys
Summary
We introduce an alignment based unsupervised approach for generating code-mixed data from par-. In this task, we tion and generation of synthetic code-mixed corpus. Submit an NMT system which is trained on the par- In section 3, we describe our approach to generate allel code-mixed English-Hinglish synthetic corpus. Gupta et al (2020) proposed an Encoder-Decoder based model which takes English sentence along with linguistic features as input and generates synthetic code-mixed sentence. Pratapa et al (2018) explored ‘Equivalence Constraint’ theory to generate the synthetic code-mixed data which is used to improve the performance of Recurrent Neural Network (RNN) based language model. While Winata et al (2019) proposed a method to generate code-mixed data using a pointer-generator network, Garg et al (2018) explored SeqGAN for code-mixed data generation
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.