Abstract

Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.

Highlights

  • Neural Machine Translation (NMT) models (Bahdanau et al, 2014; Luong et al, 2015; Wu et al, 2016; Vaswani et al, 2017) often operate with fixed word vocabularies, as their training and inference depend heavily on the vocabulary size

  • We propose a new subword segmentation algorithm based on a language model, which provides multiple segmentations with probabilities

  • We propose a new subword segmentation algorithm based on a unigram language model, which is capable of outputing multiple subword segmentations with probabilities

Read more

Summary

Introduction

Neural Machine Translation (NMT) models (Bahdanau et al, 2014; Luong et al, 2015; Wu et al, 2016; Vaswani et al, 2017) often operate with fixed word vocabularies, as their training and inference depend heavily on the vocabulary size. Hell/o/ world 13586 137 255 H/ello/ world 320 7363 255 He/llo/ world 579 10115 255 /He/l/l/o/ world 7 18085 356 356 137 255 H/el/l/o/ /world 320 585 356 137 7 12295 Table 1: Multiple subword sequences encoding the same sentence “Hello World”. While these sequences encode the same input “Hello World”, NMT handles them as completely different inputs. We propose a new regularization method for open-vocabulary NMT, called subword regularization, which employs multiple subword segmentations to make the NMT model accurate and robust. Empirical experiments using multiple corpora with different sizes and languages show that subword regularization achieves significant improvements over the method using a single subword sequence. Through experiments with out-of-domain corpora, we show that subword regularization improves the robustness of the NMT model

NMT training with on-the-fly subword sampling
Decoding
Subword segmentations with language model
Unigram language model
Subword sampling
Related Work
Setting
Main Results
Results with out-of-domain corpus
Comparison with other segmentation algorithms
Impact of sampling hyperparameters
Results with single side regularization
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call