Neural machine translation (NMT) has been prominent in many machine translation tasks. However, in some domain-specific tasks, only the corpora from similar domains can improve translation performance. If out-of-domain corpora are directly added into the in-domain corpus, the translation performance may even degrade. Therefore, domain adaptation techniques are essential to solve the NMT domain problem. Most existing methods for domain adaptation are designed for the conventional phrase-based machine translation. For NMT domain adaptation, there have been only a few studies on topics such as fine tuning, domain tags, and domain features. In this paper, we have four goals for sentence level NMT domain adaptation. First, the NMT's internal sentence embedding is exploited and the sentence embedding similarity is used to select out-of-domain sentences that are close to the in-domain corpus. Second, we propose three sentence weighting methods, i.e., sentence weighting, domain weighting, and batch weighting, to balance the data distribution during NMT training. Third, in addition, we propose dynamic training methods to adjust the sentence selection and weighting during NMT training. Fourth, to solve the multidomain problem in a real-world NMT scenario where the domain distributions of training and testing data often mismatch, we proposed a multidomain sentence weighting method to balance the domain distributions of training data and match the domain distributions of training and testing data. The proposed methods are evaluated in international workshop on spoken language translation (IWSLT) English-to-French/German tasks and a multidomain English-to-French task. Empirical results show that the sentence selection and weighting methods can significantly improve the NMT performance, outperforming the existing baselines.
Read full abstract