Abstract
Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.
Highlights
Neural Machine Translation (NMT) models (Bahdanau et al, 2014; Luong et al, 2015; Wu et al, 2016; Vaswani et al, 2017) often operate with fixed word vocabularies, as their training and inference depend heavily on the vocabulary size
We propose a new subword segmentation algorithm based on a language model, which provides multiple segmentations with probabilities
We propose a new subword segmentation algorithm based on a unigram language model, which is capable of outputing multiple subword segmentations with probabilities
Summary
Neural Machine Translation (NMT) models (Bahdanau et al, 2014; Luong et al, 2015; Wu et al, 2016; Vaswani et al, 2017) often operate with fixed word vocabularies, as their training and inference depend heavily on the vocabulary size. Hell/o/ world 13586 137 255 H/ello/ world 320 7363 255 He/llo/ world 579 10115 255 /He/l/l/o/ world 7 18085 356 356 137 255 H/el/l/o/ /world 320 585 356 137 7 12295 Table 1: Multiple subword sequences encoding the same sentence “Hello World”. While these sequences encode the same input “Hello World”, NMT handles them as completely different inputs. We propose a new regularization method for open-vocabulary NMT, called subword regularization, which employs multiple subword segmentations to make the NMT model accurate and robust. Empirical experiments using multiple corpora with different sizes and languages show that subword regularization achieves significant improvements over the method using a single subword sequence. Through experiments with out-of-domain corpora, we show that subword regularization improves the robustness of the NMT model
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have