Abstract
Existing knowledge distillation methods use large amount of bilingual data and focus on mining the corresponding knowledge distribution between the source language and the target language. However, for some languages, bilingual data is not abundant. In this paper, to make better use of both monolingual and limited bilingual data, we propose a new knowledge distillation method called Dual Knowledge Distillation (DKD). For monolingual data, we use a self-distillation strategy which combines self-training and knowledge distillation for the encoder to extract more consistent monolingual representation. For bilingual data, on top of the k Nearest Neighbor Knowledge Distillation (kNN-KD) method, a similar self-distillation strategy is adopted as a consistency regularization method to force the decoder to produce consistent output. Experiments on standard datasets, multi-domain translation datasets, and low-resource datasets show that DKD achieves consistent improvements over state-of-the-art baselines including kNN-KD.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.