Abstract

Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively.

Highlights

  • Recent advances in neural machine translation (NMT) have achieved human parity on several language pairs given large-scale parallel corpora [1,2]

  • For many language pairs, the amount of parallel corpora is limited; this is a major challenge in building high-performance machine translation (MT) systems [3]

  • The synthetic parallel data are constructed by translating the target-language monolingual data into the source language with a backward translation model trained by a given parallel training corpus

Read more

Summary

Introduction

Recent advances in neural machine translation (NMT) have achieved human parity on several language pairs given large-scale parallel corpora [1,2]. Sennrich et al [6] proposed a back-translation approach to expand a parallel training corpus with synthetic parallel data. In this approach, the synthetic parallel data are constructed by translating the target-language monolingual data into the source language with a backward translation (target-to-source) model trained by a given parallel training corpus. The synthetic parallel data are constructed by translating the target-language monolingual data into the source language with a backward translation (target-to-source) model trained by a given parallel training corpus This approach can generate a large amount of synthetic parallel data, there is no guarantee of its quality

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.