Two Effective Approaches to Data Reduction for Neural Machine Translation: Static and Dynamic Sentence Selection

Xueying Xu,Shaohui Kuang,Deyi Xiong

doi:10.1109/ialp.2018.8629243

Abstract

In this paper, we aim at data reduction for neural machine translation (NMT): selecting a subset from a very large training corpus and training NMT on this subset so as to reduce training time while meantime achieving the same or even higher translation quality. We propose two effective approaches to achieve this goal: a static sentence selection method that selects sentences into a subset before training according to their sentence embeddings, and a dynamic sentence selection method that dynamically selects sentences into epoch during training based on their training cost. We examine the effect of an n-gram based traditional data reduction method originally proposed for statistical machine translation (SMT) on NMT and compare the two proposed approaches against the traditional method. Experiments on the United Nations Parallel Corpus show that the best of the two proposed approaches can reduce training time by half and at the same time achieve improvements in translation quality with up to +0.79 BLEU.

Full Text