Future-Aware Knowledge Distillation for Neural Machine Translation

Biao Zhang,Deyi Xiong,Jinsong Su,Jiebo Luo

doi:10.1109/taslp.2019.2946480

Abstract

Although future context is widely regarded useful for word prediction in machine translation, it is quite difficult in practice to incorporate it into neural machine translation. In this paper, we propose a future-aware knowledge distillation framework (FKD) to address this issue. In the FKD framework, we learn to distill future knowledge from a backward neural language model (teacher) to future-aware vectors (student) during the training phase. The future-aware vector for each word position is computed in a bridge network and optimized towards the corresponding hidden state in the backward neural language model via a knowledge distillation mechanism. We further propose an algorithm to jointly train the neural machine translation model, neural language model and knowledge distillation module end-to-end. The learned future-aware vectors are incorporated into the attention layer of the decoder to provide full-range context information during the decoding phase. Experiments on the NIST Chinese-English and WMT English-German translation tasks show that the proposed method significantly improves translation quality and word alignment.

Full Text