IITP-MT at CALCS2021: English to Hinglish Neural Machine Translation using Unsupervised Synthetic Code-Mixed Parallel Corpus

Ramakrishna Appicharla,Asif Ekbal,Kamal Kumar Gupta,Pushpak Bhattacharyya

doi:10.18653/v1/2021.calcs-1.5

Abstract

This paper describes the system submitted by IITP-MT team to Computational Approaches to Linguistic Code-Switching (CALCS 2021) shared task on MT for English→Hinglish. We submit a neural machine translation (NMT) system which is trained on the synthetic code-mixed (cm) English-Hinglish parallel corpus. We propose an approach to create code-mixed parallel corpus from a clean parallel corpus in an unsupervised manner. It is an alignment based approach and we do not use any linguistic resources for explicitly marking any token for code-switching. We also train NMT model on the gold corpus provided by the workshop organizers augmented with the generated synthetic code-mixed parallel corpus. The model trained over the generated synthetic cm data achieves 10.09 BLEU points over the given test set.

Highlights

We introduce an alignment based unsupervised approach for generating code-mixed data from par-In this paper, we describe our submission to shared allel corpus which can be used to train the neural machine translation (NMT) task on Machine Translation (MT) for English → model for code-mixed text translation.Hinglish at CALCS 2021
We describe our submission to shared allel corpus which can be used to train the NMT
Submit an NMT system which is trained on the par- In section 3, we describe our approach to generate allel code-mixed English-Hinglish synthetic corpus. synthetic code-mixed corpus along with the sys

Summary

Introduction

We introduce an alignment based unsupervised approach for generating code-mixed data from par-. In this task, we tion and generation of synthetic code-mixed corpus. Submit an NMT system which is trained on the par- In section 3, we describe our approach to generate allel code-mixed English-Hinglish synthetic corpus. Gupta et al (2020) proposed an Encoder-Decoder based model which takes English sentence along with linguistic features as input and generates synthetic code-mixed sentence. Pratapa et al (2018) explored ‘Equivalence Constraint’ theory to generate the synthetic code-mixed data which is used to improve the performance of Recurrent Neural Network (RNN) based language model. While Winata et al (2019) proposed a method to generate code-mixed data using a pointer-generator network, Garg et al (2018) explored SeqGAN for code-mixed data generation

System Description

Romanization of the Hindi text

Experimental Setup

Results

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

IITP-MT at CALCS2021: English to Hinglish Neural Machine Translation using Unsupervised Synthetic Code-Mixed Parallel Corpus

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 3	License type: cc-by

Similar Papers

Baidu Translate: Research and Products
Zhongjun He
-
Zhongjun HeZhongjun He
01 Jan 2015
01 Jan 2015

Iterative Training of Unsupervised Neural and Statistical Machine Translation Systems
Benjamin Marie ... Atsushi Fujita
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 19
Benjamin Marie, et. al.Benjamin Marie ... Atsushi Fujita
01 Jun 2020
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 19

Framework for Handling Rare Word Problems in Neural Machine Translation System Using Multi-Word Expressions
Kamal Deep Garg ... Rajeswari Chengoden
Applied Sciences | VOL. 12
Kamal Deep Garg, et. al.Kamal Deep Garg ... Rajeswari Chengoden
31 Oct 2022
Applied Sciences | VOL. 12

Automatic Resource Augmentation for Machine Translation in Low Resource Language: EnIndic Corpus
Anasua Banerjee ... Debajyoty Banik
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -
Anasua Banerjee, et. al.Anasua Banerjee ... Debajyoty Banik
31 Aug 2023
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

IITP-MT at CALCS2021: English to Hinglish Neural Machine Translation using Unsupervised Synthetic Code-Mixed Parallel Corpus

Abstract

Highlights

Summary

Talk to us

Similar Papers