Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation

Pei Zhang,Niyu Ge,Kai Fan,Boxing Chen

doi:10.18653/v1/2020.emnlp-main.81

Abstract

Many document-level neural machine translation (NMT) systems have explored the utility of context-aware architecture, usually requiring an increasing number of parameters and computational complexity. However, few attention is paid to the baseline model. In this paper, we research extensively the pros and cons of the standard transformer in document-level translation, and find that the auto-regressive property can simultaneously bring both the advantage of the consistency and the disadvantage of error accumulation. Therefore, we propose a surprisingly simple long-short term masking self-attention on top of the standard transformer to both effectively capture the long-range dependence and reduce the propagation of errors. We examine our approach on the two publicly available document-level datasets. We can achieve a strong result in BLEU and capture discourse phenomena.

Highlights

Recent advances in deep learning have led to significant improvement of Neural Machine Translation (NMT) (Sutskever et al, 2014; Bahdanau et al, 2014; Luong et al, 2015; Vaswani et al, 2017)
The contributions of this paper are threefold: i) we extensively research the performance of the standard transformer in the setting of multisentence input and output; ii) we propose a simple but effective modification to adapting the transformer for document NMT with the aim of ameliorating the effect of error accumulation; iii) our experiments demonstrate that even the simple baseline can achieve comparable results
When we apply the partial copy trick to our model, the lexical cohesion can boost by 27% but the BLEU is sacrificed

Summary

Introduction

Recent advances in deep learning have led to significant improvement of Neural Machine Translation (NMT) (Sutskever et al, 2014; Bahdanau et al, 2014; Luong et al, 2015; Vaswani et al, 2017). Document-level NMT, as a more realistic translation task in these scenarios, has been systematically. Most literatures focused on looking back a fixed number of previous source or target sentences as the document-level context (Tu et al, 2018; Voita et al, 2018; Zhang et al, 2018; Miculicich et al, 2018; Voita et al, 2019a,b). We elect to pay attention to the context in the previous n sentences only where n is a small number and usually does not cover the entire document

Methods

Results

Conclusion