Pre-trained language models, such as Bidirectional Encoder Representations from Transformers (BERT), have demonstrated state-of-the-art performance in many Natural Language Processing (NLP) downstream tasks. Incorporating pre-trained BERT knowledge into the Sequence-to-Sequence (Seq2Seq) model can significantly enhance machine translation performance, particularly for low-resource language pairs. However, most previous studies prefer to fine-tune both the large pre-trained BERT model and the Seq2Seq model jointly, leading to costly training times, especially with limited parallel data pairs. Consequently, the integration of pre-trained BERT contextual representations into the Seq2Seq framework is limited. In this paper, we propose a simple and effective BERT knowledge fusion approach based on regularized Mixup for low-resource Neural Machine Translation (NMT), referred to as ReMixup-NMT, which constrains the distributions of the normal Transformer encoder and the Mixup-based Transformer encoder to be consistent. The proposed ReMixup NMT approach is able to distill and fuse the pre-trained BERT knowledge into Seq2Seq NMT architecture in an efficient manner with non-additional parameters training. Experiment results on six low-resource NMT tasks show the proposed approach outperforms the state-of-the-art (SOTA) BERT-fused and drop-based methods on IWSLT’15 English→Vietnamese and IWSLT’17 English→French datasets.
Read full abstract