Abstract

This paper describes the system submitted by IITP-MT team to Computational Approaches to Linguistic Code-Switching (CALCS 2021) shared task on MT for English→Hinglish. We submit a neural machine translation (NMT) system which is trained on the synthetic code-mixed (cm) English-Hinglish parallel corpus. We propose an approach to create code-mixed parallel corpus from a clean parallel corpus in an unsupervised manner. It is an alignment based approach and we do not use any linguistic resources for explicitly marking any token for code-switching. We also train NMT model on the gold corpus provided by the workshop organizers augmented with the generated synthetic code-mixed parallel corpus. The model trained over the generated synthetic cm data achieves 10.09 BLEU points over the given test set.

Highlights

  • We introduce an alignment based unsupervised approach for generating code-mixed data from par-In this paper, we describe our submission to shared allel corpus which can be used to train the neural machine translation (NMT) task on Machine Translation (MT) for English → model for code-mixed text translation.Hinglish at CALCS 2021

  • We describe our submission to shared allel corpus which can be used to train the NMT

  • Submit an NMT system which is trained on the par- In section 3, we describe our approach to generate allel code-mixed English-Hinglish synthetic corpus. synthetic code-mixed corpus along with the sys

Read more

Summary

Introduction

We introduce an alignment based unsupervised approach for generating code-mixed data from par-. In this task, we tion and generation of synthetic code-mixed corpus. Submit an NMT system which is trained on the par- In section 3, we describe our approach to generate allel code-mixed English-Hinglish synthetic corpus. Gupta et al (2020) proposed an Encoder-Decoder based model which takes English sentence along with linguistic features as input and generates synthetic code-mixed sentence. Pratapa et al (2018) explored ‘Equivalence Constraint’ theory to generate the synthetic code-mixed data which is used to improve the performance of Recurrent Neural Network (RNN) based language model. While Winata et al (2019) proposed a method to generate code-mixed data using a pointer-generator network, Garg et al (2018) explored SeqGAN for code-mixed data generation

System Description
Romanization of the Hindi text
Experimental Setup
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.