On the Relation between Position Information and Sentence Length in Neural Machine Translation

Masato Neishi,Naoki Yoshinaga

doi:10.18653/v1/k19-1031

Abstract

Long sentences have been one of the major challenges in neural machine translation (NMT). Although some approaches such as the attention mechanism have partially remedied the problem, we found that the current standard NMT model, Transformer, has difficulty in translating long sentences compared to the former standard, Recurrent Neural Network (RNN)-based model. One of the key differences of these NMT models is how the model handles position information which is essential to process sequential data. In this study, we focus on the position information type of NMT models, and hypothesize that relative position is better than absolute position. To examine the hypothesis, we propose RNN-Transformer which replaces positional encoding layer of Transformer by RNN, and then compare RNN-based model and four variants of Transformer. Experiments on ASPEC English-to-Japanese and WMT2014 English-to-German translation tasks demonstrate that relative position helps translating sentences longer than those in the training data. Further experiments on length-controlled training data reveal that absolute position actually causes overfitting to the sentence length.

Highlights

Sequence to sequence models for neural machine translation (NMT) are utilized for various text generation tasks including automatic summarization (Chopra et al, 2016; Nallapati et al, 2016; Rush et al, 2015) and dialogue systems (Vinyals and Le, 2015; Shang et al, 2015); the models are required to take inputs of various length
Koehn and Knowles (2017) report that even recurrent neural network (RNN)-based model with the attention mechanism performs worse than phrase-based statistical machine translation (Koehn et al, 2007) in translating very long sentences, which challenges us to develop an NMT model that is robust to long sentences or more generally, variations in input length
Have the recent advances in NMT achieved the robustness to the variations in input length? NMT has been advancing by upgrading the model architecture: RNN-based model (Cho et al, 2014; Sutskever et al, 2014; Bahdanau et al, 2015; Luong et al, 2015) followed by convolutional neural network (CNN)-based model (Kalchbrenner et al, 2016; Gehring et al, 2017) and attention-based model (Vaswani et al, 2017) called Transformer (§ 2)

Summary

Introduction

Sequence to sequence models for neural machine translation (NMT) are utilized for various text generation tasks including automatic summarization (Chopra et al, 2016; Nallapati et al, 2016; Rush et al, 2015) and dialogue systems (Vinyals and Le, 2015; Shang et al, 2015); the models are required to take inputs of various length. Studies on recurrent neural network (RNN)-based model analyze the translation quality with respect to the sentence length, and show that their models improve translations for long sentences, using the long short-term memory (LSTM) (Sutskever et al, 2014) or introducing the attention mechanism (Bahdanau et al, 2015; Luong et al, 2015). Koehn and Knowles (2017) report that even RNN-based model with the attention mechanism performs worse than phrase-based statistical machine translation (Koehn et al, 2007) in translating very long sentences, which challenges us to develop an NMT model that is robust to long sentences or more generally, variations in input length. We came up with a question whether Transformer have acquired the robustness to the variations in input length

Methods

Results

Conclusion