Recycling a Pre-trained BERT Encoder for Neural Machine Translation

Kenji Imamura,Eiichiro Sumita

doi:10.18653/v1/d19-5603

Abstract

In this paper, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model is applied to Transformer-based neural machine translation (NMT). In contrast to monolingual tasks, the number of unlearned model parameters in an NMT decoder is as huge as the number of learned parameters in the BERT model. To train all the models appropriately, we employ two-stage optimization, which first trains only the unlearned parameters by freezing the BERT model, and then fine-tunes all the sub-models. In our experiments, stable two-stage optimization was achieved, in contrast the BLEU scores of direct fine-tuning were extremely low. Consequently, the BLEU scores of the proposed method were better than those of the Transformer base model and the same model without pre-training. Additionally, we confirmed that NMT with the BERT encoder is more effective in low-resource settings.

Highlights

Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al, 2019) is a language representation model trained in advance on a very large monolingual dataset
Considering the score of newstest2015 in the experiment in Table 3 was +1.41 for the same settings, these results show that the BERT encoder is more effective for improving translation quality in a low-resource setting
That is, decoder training and fine-tuning, because the number of unlearned parameters was as large as the number of pre-trained model parameters

Summary

Introduction

Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al, 2019) is a language representation model trained in advance on a very large monolingual dataset. We adapt this model to our own tasks after fine-tuning (Freitag and Al-Onaizan, 2016; Servan et al, 2016) using task-specific data. Models in which the ideas of BERT are extended to multiple languages have been proposed (Lample and Conneau, 2019). These models, which are pre-trained using multilingual data, are called cross-lingual language models (XLMs). We can construct a machine translation system using two XLM models as the encoder and decoder

Objectives

Methods

Results

Discussion

Conclusion