Abstract

The scale of parallel corpus plays an important role in training high-quality neural machine translation models. In order to expand the scale of parallel corpus in low-resource scenarios, researchers have proposed a series of data augmentation approaches, in which the most representative work is the back-translation. The back-translation approach uses a basic translation model to translate the target monolingual corpus into the source language, and then combines it into a pseudo parallel corpus to expand the training data. Due to the simple and efficient data expansion strategies, the back-translation approach becomes the mainstream data augmentation approach of neural machine translation. However, in low-resource scenarios, only low-precision basic translation models can be used. The pseudo-parallel corpus translated by this model contains low-quality sentence pairs, which inevitably introduces noises and leads to negative impacts to translation procedure. To improve the performance of low-resource neural machine translation, we propose a filtered data augmentation method based on the model competence evaluation, which can effectively improve the quality of training data by using both the filtered back-translation and dynamic evaluation of translation model capabilities. Comparative experiments show that the proposed approach can improve the quality of training data and the performance of low-resource neural machine translation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call