Abstract
Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively.
Highlights
Recent advances in neural machine translation (NMT) have achieved human parity on several language pairs given large-scale parallel corpora [1,2]
For many language pairs, the amount of parallel corpora is limited; this is a major challenge in building high-performance machine translation (MT) systems [3]
The synthetic parallel data are constructed by translating the target-language monolingual data into the source language with a backward translation model trained by a given parallel training corpus
Summary
Recent advances in neural machine translation (NMT) have achieved human parity on several language pairs given large-scale parallel corpora [1,2]. Sennrich et al [6] proposed a back-translation approach to expand a parallel training corpus with synthetic parallel data. In this approach, the synthetic parallel data are constructed by translating the target-language monolingual data into the source language with a backward translation (target-to-source) model trained by a given parallel training corpus. The synthetic parallel data are constructed by translating the target-language monolingual data into the source language with a backward translation (target-to-source) model trained by a given parallel training corpus This approach can generate a large amount of synthetic parallel data, there is no guarantee of its quality
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.