Abstract

While non-autoregressive (NAR) models are showing great promise for machine translation, their use is limited by their dependence on knowledge distillation from autoregressive models. To address this issue, we seek to understand why distillation is so effective. Prior work suggests that distilled training data is less complex than manual translations. Based on experiments with the Levenshtein Transformer and the Mask-Predict NAR models on the WMT14 German-English task, this paper shows that different types of complexity have different impacts: while reducing lexical diversity and decreasing reordering complexity both help NAR learn better alignment between source and target, and thus improve translation quality, lexical diversity is the main reason why distillation increases model confidence, which affects the calibration of different NAR models differently.

Highlights

  • Introduction and BackgroundWhen training NAR models for neural machine translation (NMT), sequence-level knowledge distillation (Kim and Rush, 2016) is key to match the translation quality of autoregressive (AR) models (Gu et al, 2018; Lee et al, 2018; Ghazvininejad et al, 2019; Gu et al, 2019)

  • Sequence-level knowledge distillation (SLKD) trains the student model p(y | x) to approximate the teacher distribution q(y | x) by maximizing the following objective: LSEQ-KD = − y∈Y q(y | x) log p(y | x) ≈ − y∈Y 1 [y = y] log p(y | x), where Y represents the space of all possible target sequences, and yis the output from running beam search with the teacher model q

  • Training samples can be complex in different ways, and it remains unclear how different types of data complexity alter the internal working of NAR models and their translation quality

Read more

Summary

Introduction and Background

When training NAR models for neural machine translation (NMT), sequence-level knowledge distillation (Kim and Rush, 2016) is key to match the translation quality of autoregressive (AR) models (Gu et al, 2018; Lee et al, 2018; Ghazvininejad et al, 2019; Gu et al, 2019). Gu et al (2018) hypothesize that SLKD reduces the number of modes in the output distribution (alternative translations for a source) This hypothesis was supported by experiments that use multiway parallel data to simulate the modes (Zhou et al, 2019). Zhou et al (2019) investigate the impact of data complexity on NAR translation quality – they generate distilled data of varying complexity with AR models of different capacity and show that higher-capacity NAR models require more complex distilled data to achieve better translation quality They further show that generating distilled references with mixture of experts (Shen et al, 2019) improves NAR translation quality. Experiments show that decreasing reordering complexity and reducing lexical diversity via distillation both help NAR models learn better alignment between source and target and improve translation quality.

Generating Diverse Distilled References
Experimental Settings
Preliminary
Reduced Lexical Diversity in SLKD Improves Translation Quality
SLKD Increases Confidence of Source-Target Attention
Decodi6ng step 8
Reduced Lexical Diversity in SLKD Improves Model Confidence
Conclusion
A Data Preprocessing Details
B Model and Training Details
D Reference Generation Examples
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.