Exploiting Deep Representations for Neural Machine Translation

Zi-Yi Dou,Zhaopeng Tu,Tong Zhang,Shuming Shi,Xing Wang

doi:10.18653/v1/d18-1457

Abstract

Advanced neural machine translation (NMT) models generally implement encoder and decoder as multiple layers, which allows systems to model complex functions and capture complicated linguistic structures. However, only the top layers of encoder and decoder are leveraged in the subsequent process, which misses the opportunity to exploit the useful information embedded in other layers. In this work, we propose to simultaneously expose all of these signals with layer aggregation and multi-layer attention mechanisms. In addition, we introduce an auxiliary regularization term to encourage different layers to capture diverse information. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation data demonstrate the effectiveness and universality of the proposed approach.

Highlights

Neural machine translation (NMT) models have advanced the machine translation community in recent years (Kalchbrenner and Blunsom, 2013; Cho et al, 2014; Sutskever et al, 2014)
Current NMT models only leverage the top layers of encoder and decoder in the subsequent process, which misses the opportunity to exploit useful information embedded in other layers
Layer Aggregation (Rows 2-5): dense connection and linear combination only marginally improve translation performance, iterative and hierarchical aggregation strategies achieve more significant improvements, which are up to +0.99 BLEU points better than the baseline model. This indicates that deep aggregations outperform their shallow counterparts by incorporating more depth and sharing, which is consistent with the results in computer vision tasks (Yu et al, 2018)

Summary

Introduction

Neural machine translation (NMT) models have advanced the machine translation community in recent years (Kalchbrenner and Blunsom, 2013; Cho et al, 2014; Sutskever et al, 2014). Nowadays, advanced NMT models generally implement encoder and decoder as multiple layers, regardless of the specific model architectures such as RNN (Zhou et al, 2016; Wu et al, 2016), CNN (Gehring et al, 2017), or Self-Attention Network (Vaswani et al, 2017; Chen et al, 2018). In natural language processing community, Peters et al (2018) have proven that simultaneously exposing all layer representations outperforms methods that utilize just the top layer for transfer learning tasks. Multiple-layer encoder and decoder are employed to perform the translation task through a series of nonlinear transformations from the representation of input sequences to final output sequences. The output of the first sub-layer Cle and the second sub-layer Hle are calculated as

Methods

Results

Conclusion