Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Taku Kudo

doi:10.18653/v1/p18-1007

Abstract

Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.

Highlights

Neural Machine Translation (NMT) models (Bahdanau et al, 2014; Luong et al, 2015; Wu et al, 2016; Vaswani et al, 2017) often operate with fixed word vocabularies, as their training and inference depend heavily on the vocabulary size
We propose a new subword segmentation algorithm based on a language model, which provides multiple segmentations with probabilities
We propose a new subword segmentation algorithm based on a unigram language model, which is capable of outputing multiple subword segmentations with probabilities

Summary

Introduction

Neural Machine Translation (NMT) models (Bahdanau et al, 2014; Luong et al, 2015; Wu et al, 2016; Vaswani et al, 2017) often operate with fixed word vocabularies, as their training and inference depend heavily on the vocabulary size. Hell/o/ world 13586 137 255 H/ello/ world 320 7363 255 He/llo/ world 579 10115 255 /He/l/l/o/ world 7 18085 356 356 137 255 H/el/l/o/ /world 320 585 356 137 7 12295 Table 1: Multiple subword sequences encoding the same sentence “Hello World”. While these sequences encode the same input “Hello World”, NMT handles them as completely different inputs. We propose a new regularization method for open-vocabulary NMT, called subword regularization, which employs multiple subword segmentations to make the NMT model accurate and robust. Empirical experiments using multiple corpora with different sizes and languages show that subword regularization achieves significant improvements over the method using a single subword sequence. Through experiments with out-of-domain corpora, we show that subword regularization improves the robustness of the NMT model

NMT training with on-the-fly subword sampling

Decoding

Subword segmentations with language model

Unigram language model

Subword sampling

Related Work

Setting

Main Results

Results with out-of-domain corpus

Comparison with other segmentation algorithms

Impact of sampling hyperparameters

Results with single side regularization

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2018
Citations: 705	License type: cc-by

Similar Papers

BPE-Dropout: Simple and Effective Subword Regularization
Dmitrii Emelianenko ... Ivan Provilkov
-
Dmitrii Emelianenko, et. al.Dmitrii Emelianenko ... Ivan Provilkov
01 Jan 2020
01 Jan 2020

Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings
Laura Mascarell ... Annette Rios Gonzales
-
Laura Mascarell, et. al.Laura Mascarell ... Annette Rios Gonzales
01 Jan 2017
01 Jan 2017

A Compression-Based Multiple Subword Segmentation for Neural Machine Translation
Hiroshi Sakamoto ... Kazutaka Shimada
Electronics | VOL. 11
Hiroshi Sakamoto, et. al.Hiroshi Sakamoto ... Kazutaka Shimada
24 Mar 2022
Electronics | VOL. 11

Bilingual Subword Segmentation for Neural Machine Translation
Hiroyuki Deguchi ... Masao Utiyama
-
Hiroyuki Deguchi, et. al.Hiroyuki Deguchi ... Masao Utiyama
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Abstract

Highlights

Summary

Talk to us

Similar Papers