Abstract
While synthetic bilingual corpora have demonstrated their effectiveness in low-resource neural machine translation (NMT), adding more synthetic data often deteriorates translation performance. In this work, we propose alternated training with synthetic and authentic data for NMT. The basic idea is to alternate synthetic and authentic corpora iteratively during training. Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data. Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines. We visualize the BLEU landscape to further investigate the role of authentic and synthetic data during alternated training. From the visualization, we find that authentic data helps to direct the NMT model parameters towards points with higher BLEU scores and leads to consistent translation performance improvement.
Highlights
While recent years have witnessed the rapid development of Neural Machine Translation (NMT) (Sutskever et al, 2014; Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017), it heavily relies on large-scale, high-quality bilingual corpora
We propose alternated training with synthetic and authentic data for neural machine translation
We introduce authentic data as guidance to prevent the training of neural machine translation (NMT) models from being disturbed by noisy synthetic data
Summary
While recent years have witnessed the rapid development of Neural Machine Translation (NMT) (Sutskever et al, 2014; Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017), it heavily relies on large-scale, high-quality bilingual corpora. One direction to alleviate the problem is to add noise or a special tag on the source side of synthetic data, which enables NMT models to distinguish between authentic and synthetic data (Edunov et al, 2018; Caswell et al, 2019). Another direction is to filter or evaluate the synthetic data by calculating confidence over corpora, making NMT models better exploit synthetic data (Imamura et al, 2018; Wang et al, 2019). Experiments on ChineseEnglish translation tasks show that our approach improves the performance over strong baselines
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.