Improving Neural Machine Translation by Filtering Synthetic Parallel Data

Guanghao Xu,Youngjoong Ko,Jungyun Seo

doi:10.3390/e21121213

Guanghao Xu, Youngjoong Ko + Show 1 more

Open Access

PDF Available

https://doi.org/10.3390/e21121213

Copy DOI

Export

Save

Cite

Journal: Entropy	Publication Date: Dec 11, 2019
Citations: 6	License type: CC BY 4.0

Affiliation: Sogang University, Sungkyunkwan University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively.

Highlights

Recent advances in neural machine translation (NMT) have achieved human parity on several language pairs given large-scale parallel corpora [1,2]
For many language pairs, the amount of parallel corpora is limited; this is a major challenge in building high-performance machine translation (MT) systems [3]
The synthetic parallel data are constructed by translating the target-language monolingual data into the source language with a backward translation model trained by a given parallel training corpus

Summary

Introduction

Recent advances in neural machine translation (NMT) have achieved human parity on several language pairs given large-scale parallel corpora [1,2]. Sennrich et al [6] proposed a back-translation approach to expand a parallel training corpus with synthetic parallel data. In this approach, the synthetic parallel data are constructed by translating the target-language monolingual data into the source language with a backward translation (target-to-source) model trained by a given parallel training corpus. The synthetic parallel data are constructed by translating the target-language monolingual data into the source language with a backward translation (target-to-source) model trained by a given parallel training corpus This approach can generate a large amount of synthetic parallel data, there is no guarantee of its quality

Methods

Results

Conclusion