Abstract

Although future context is widely regarded useful for word prediction in machine translation, it is quite difficult in practice to incorporate it into neural machine translation. In this paper, we propose a future-aware knowledge distillation framework (FKD) to address this issue. In the FKD framework, we learn to distill future knowledge from a backward neural language model (teacher) to future-aware vectors (student) during the training phase. The future-aware vector for each word position is computed in a bridge network and optimized towards the corresponding hidden state in the backward neural language model via a knowledge distillation mechanism. We further propose an algorithm to jointly train the neural machine translation model, neural language model and knowledge distillation module end-to-end. The learned future-aware vectors are incorporated into the attention layer of the decoder to provide full-range context information during the decoding phase. Experiments on the NIST Chinese-English and WMT English-German translation tasks show that the proposed method significantly improves translation quality and word alignment.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.