Abstract

Recently, the pretraining of models has been successfully applied to unsupervised and semi-supervised neural machine translation. A cross-lingual language model uses a pretrained masked language model to initialize the encoder and decoder of the translation model, which greatly improves the translation quality. However, because of a mismatch in the number of layers, the pretrained model can only initialize part of the decoder’s parameters. In this paper, we use a layer-wise coordination transformer and a consistent pretraining translation transformer instead of a vanilla transformer as the translation model. The former has only an encoder, and the latter has an encoder and a decoder, but the encoder and decoder have exactly the same parameters. Both models can guarantee that all parameters in the translation model can be initialized by the pretrained model. Experiments on the Chinese–English and English–German datasets show that compared with the vanilla transformer baseline, our models achieve better performance with fewer parameters when the parallel corpus is small.

Highlights

  • Neural machine translation (NMT), which is trained in an end-to-end fashion [1,2,3,4], has become the mainstream of machine translation methods, and has even reached the human level in some fields [5,6,7]

  • In order to solve these problems, we propose a new transformer variant based on the vanilla transformer and layer-wise coordination transformer, which is called consistent pretraining translation transformer (CPTT)

  • The pretrained model shares token embedding between source language and target language, but the NMT model transformer in XLM does not share token embedding between encoder and decoder

Read more

Summary

Introduction

Neural machine translation (NMT), which is trained in an end-to-end fashion [1,2,3,4], has become the mainstream of machine translation methods, and has even reached the human level in some fields [5,6,7]. For low-resource semi-supervised neural machine translation, XLM first trains a transformer encoder on both source and target language monolingual data through masked language modeling, and a pretrained model is used to initialize the encoder and decoder of transformer. We still use the mask language modeling as the pretraining task, but we use two transformer variants instead of the vanilla transformer as the translation model One of these translation models is layer-wise coordination transformer [20] and the other is called consistent pretraining translation transformer. In order to keep models consistent between pretraining and translation, we propose to use the layer-wise coordination transformer to replace the vanilla transformer as the translation model. 2. Based on the vanilla transformer and the layer-wise coordination transformer, we propose a consistent pretraining translation transformer, which obtains better performance in the pretraining fine-tuning mode.

Related Works
Background
Transformer-Based NMT
Our Models
Layer-Wise Coordination Transformer
Consistent Pretraining Translation Transformer
Other Model Details
Datasets and Preprocessing
Model Configurations
Results and Analysis
Ablation Study
The Influence of Parallel Corpus Size
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call