How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?

Weijia Xu,Dongdong Zhang,Shuming Ma,Marine Carpuat

doi:10.18653/v1/2021.findings-acl.385

Abstract

While non-autoregressive (NAR) models are showing great promise for machine translation, their use is limited by their dependence on knowledge distillation from autoregressive models. To address this issue, we seek to understand why distillation is so effective. Prior work suggests that distilled training data is less complex than manual translations. Based on experiments with the Levenshtein Transformer and the Mask-Predict NAR models on the WMT14 German-English task, this paper shows that different types of complexity have different impacts: while reducing lexical diversity and decreasing reordering complexity both help NAR learn better alignment between source and target, and thus improve translation quality, lexical diversity is the main reason why distillation increases model confidence, which affects the calibration of different NAR models differently.

Highlights

Introduction and BackgroundWhen training NAR models for neural machine translation (NMT), sequence-level knowledge distillation (Kim and Rush, 2016) is key to match the translation quality of autoregressive (AR) models (Gu et al, 2018; Lee et al, 2018; Ghazvininejad et al, 2019; Gu et al, 2019)
Sequence-level knowledge distillation (SLKD) trains the student model p(y | x) to approximate the teacher distribution q(y | x) by maximizing the following objective: LSEQ-KD = − y∈Y q(y | x) log p(y | x) ≈ − y∈Y 1 [y = y] log p(y | x), where Y represents the space of all possible target sequences, and yis the output from running beam search with the teacher model q
Training samples can be complex in different ways, and it remains unclear how different types of data complexity alter the internal working of NAR models and their translation quality

Summary

Introduction and Background

When training NAR models for neural machine translation (NMT), sequence-level knowledge distillation (Kim and Rush, 2016) is key to match the translation quality of autoregressive (AR) models (Gu et al, 2018; Lee et al, 2018; Ghazvininejad et al, 2019; Gu et al, 2019). Gu et al (2018) hypothesize that SLKD reduces the number of modes in the output distribution (alternative translations for a source) This hypothesis was supported by experiments that use multiway parallel data to simulate the modes (Zhou et al, 2019). Zhou et al (2019) investigate the impact of data complexity on NAR translation quality – they generate distilled data of varying complexity with AR models of different capacity and show that higher-capacity NAR models require more complex distilled data to achieve better translation quality They further show that generating distilled references with mixture of experts (Shen et al, 2019) improves NAR translation quality. Experiments show that decreasing reordering complexity and reducing lexical diversity via distillation both help NAR models learn better alignment between source and target and improve translation quality.

Generating Diverse Distilled References

Experimental Settings

Preliminary

Reduced Lexical Diversity in SLKD Improves Translation Quality

SLKD Increases Confidence of Source-Target Attention

Decodi6ng step 8

Reduced Lexical Diversity in SLKD Improves Model Confidence

Conclusion

A Data Preprocessing Details

B Model and Training Details

D Reference Generation Examples

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 3	License type: cc-by

Similar Papers

How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?

-

01 Aug 2021
01 Aug 2021

A Study of Non-autoregressive Model for Sequence Generation
Yi Ren ... sheng zhao
-
Yi Ren, et. al.Yi Ren ... sheng zhao
01 Jan 2020
01 Jan 2020

Non-Autoregressive Machine Translation: It's Not as Fast as it Seems
...
-
, et. al. ...
29 Jun 2022
29 Jun 2022

Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition
Zhengkun Tian ... Shuai Zhang
IEEE Signal Processing Letters | VOL. 29
Zhengkun Tian, et. al.Zhengkun Tian ... Shuai Zhang
01 Jan 2021
IEEE Signal Processing Letters | VOL. 29

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?

Abstract

Highlights

Summary

Talk to us

Similar Papers