Abstract

While synthetic bilingual corpora have demonstrated their effectiveness in low-resource neural machine translation (NMT), adding more synthetic data often deteriorates translation performance. In this work, we propose alternated training with synthetic and authentic data for NMT. The basic idea is to alternate synthetic and authentic corpora iteratively during training. Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data. Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines. We visualize the BLEU landscape to further investigate the role of authentic and synthetic data during alternated training. From the visualization, we find that authentic data helps to direct the NMT model parameters towards points with higher BLEU scores and leads to consistent translation performance improvement.

Highlights

  • While recent years have witnessed the rapid development of Neural Machine Translation (NMT) (Sutskever et al, 2014; Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017), it heavily relies on large-scale, high-quality bilingual corpora

  • We propose alternated training with synthetic and authentic data for neural machine translation

  • We introduce authentic data as guidance to prevent the training of neural machine translation (NMT) models from being disturbed by noisy synthetic data

Read more

Summary

Introduction

While recent years have witnessed the rapid development of Neural Machine Translation (NMT) (Sutskever et al, 2014; Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017), it heavily relies on large-scale, high-quality bilingual corpora. One direction to alleviate the problem is to add noise or a special tag on the source side of synthetic data, which enables NMT models to distinguish between authentic and synthetic data (Edunov et al, 2018; Caswell et al, 2019). Another direction is to filter or evaluate the synthetic data by calculating confidence over corpora, making NMT models better exploit synthetic data (Imamura et al, 2018; Wang et al, 2019). Experiments on ChineseEnglish translation tasks show that our approach improves the performance over strong baselines

Alternated Training
Experiments
Results
BLEU Landscape Visualization
Related Work
Conclusion
A Method for Visualization
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.