Deep Transformer modeling via grouping skip connection for neural machine translation

Yachao Li,Junhui Li,Min Zhang

doi:10.1016/j.knosys.2021.107556

Abstract

Most of the deep neural machine translation (NMT) models are based on a bottom-up feedforward fashion, in which representations in low layers construct or modulate high layers representations. We conjecture that this unidirectional encoding fashion could be a potential issue in building a deep NMT model. In this paper, we propose to build a deeper Transformer encoder by properly organizing encoder layers into multiple groups, which are connected via a grouping skip connection mechanism. Here, each group is further appropriately fed into subsequent groups to build a deep Transformer encoder. In this way, we successfully build a deep Transformer encoder with up to 48 layers. Moreover, we can share the parameters among groups to extend the encoder (virtual) depth even without introducing additional parameters. Detailed experimentation on the large-scale WMT (workshop on machine translation) 2014 English-to-German, English-to-French translation, WMT 2016 English-to-German, and WMT 2017 Chinese-to-English tasks demonstrates that our proposed deep Transformer model significantly outperforms the strong Transformer baseline. Furthermore, we carry out linguistic probing tasks to analyze the problems existing in the original Transformer model and explain how our deep Transformer encoder improves the translation quality. One particularly nice property of our approach is that it is incredibly easy to implement. We make our code available on Github https://github.com/liyc7711/deep-nmt.

Full Text