Abstract

In this paper, we aim at data reduction for neural machine translation (NMT): selecting a subset from a very large training corpus and training NMT on this subset so as to reduce training time while meantime achieving the same or even higher translation quality. We propose two effective approaches to achieve this goal: a static sentence selection method that selects sentences into a subset before training according to their sentence embeddings, and a dynamic sentence selection method that dynamically selects sentences into epoch during training based on their training cost. We examine the effect of an n-gram based traditional data reduction method originally proposed for statistical machine translation (SMT) on NMT and compare the two proposed approaches against the traditional method. Experiments on the United Nations Parallel Corpus show that the best of the two proposed approaches can reduce training time by half and at the same time achieve improvements in translation quality with up to +0.79 BLEU.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.