A Study of Non-autoregressive Model for Sequence Generation

Yi Ren,Xu Tan,Zhou Zhao,Jinglin Liu,Tie-Yan Liu,sheng zhao

doi:10.18653/v1/2020.acl-main.15

Abstract

Non-autoregressive (NAR) models generate all the tokens of a sequence in parallel, resulting in faster generation speed compared to their autoregressive (AR) counterparts but at the cost of lower accuracy. Different techniques including knowledge distillation and source-target alignment have been proposed to bridge the gap between AR and NAR models in various tasks such as neural machine translation (NMT), automatic speech recognition (ASR), and text to speech (TTS). With the help of those techniques, NAR models can catch up with the accuracy of AR models in some tasks but not in some others. In this work, we conduct a study to understand the difficulty of NAR sequence generation and try to answer: (1) Why NAR models can catch up with AR models in some tasks but not all? (2) Why techniques like knowledge distillation and source-target alignment can help NAR models. Since the main difference between AR and NAR models is that NAR models do not use dependency among target tokens while AR models do, intuitively the difficulty of NAR sequence generation heavily depends on the strongness of dependency among target tokens. To quantify such dependency, we propose an analysis model called CoMMA to characterize the difficulty of different NAR sequence generation tasks. We have several interesting findings: 1) Among the NMT, ASR and TTS tasks, ASR has the most target-token dependency while TTS has the least. 2) Knowledge distillation reduces the target-token dependency in target sequence and thus improves the accuracy of NAR models. 3) Source-target alignment constraint encourages dependency of a target token on source tokens and thus eases the training of NAR models.

Highlights

We find that R(p) in neural machine translation (NMT) decreases quicker than the other two tasks, which indicates that NMT is good at learning from source context when less context information can be leveraged from the target side while R(p) in automatic speech recognition (ASR) decreases little
It can be seen that knowledge distillation can boost the accuracy of NAR in NMT and text to speech (TTS), which is consistent with the previous works
We conducted a comprehensive study on NAR models in NMT, ASR and TTS tasks to analyze several research questions, including the difficulty of NAR generation and why knowledge distillation and alignment constraint can help NAR models

Summary

Introduction

Non-autoregressive (NAR) models (Oord et al, 2017; Gu et al, 2017; Chen et al, 2019; Ren et al, 2019), which generate all the tokens in a target sequence in parallel and can speed up inference, are widely explored in natural language and speech processing tasks such as neural machine translation (NMT) (Gu et al, 2017; Lee et al, 2018; Guo et al, 2019a; Wang et al, 2019; Li et al, 2019b; Guo et al, 2019b), automatic speech recognition (ASR) (Chen et al, 2019) and text to speech (TTS) synthesis (Oord et al, 2017; Ren et al, 2019). To better understand NAR sequence generation and answer the above questions, we need to characterize and quantify the target-token dependency, which turns out to be non-trivial since the sequences could be of different modalities (i.e., speech or text). For this purpose, we design a novel model called COnditional Masked prediction model with MixAttention (CoMMA), inspired by the mix-attention in He et al (2018) and the masked language modeling in Devlin et al (2018): in CoMMA, (1) the prediction of one target token can attend to all the source and target tokens with mix-attention, and 2) target tokens are randomly masked with varying probabilities. CoMMA can help us to measure target-token dependency using the ratio of the attention weights on target context over that on full (both source and target) context when predicting a target token: bigger ratio, larger dependency among target tokens

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Study of Non-autoregressive Model for Sequence Generation

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 48	License type: cc-by

Similar Papers

Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition
Zhengkun Tian ... Shuai Zhang
IEEE Signal Processing Letters | VOL. 29
Zhengkun Tian, et. al.Zhengkun Tian ... Shuai Zhang
01 Jan 2021
IEEE Signal Processing Letters | VOL. 29

Non-Autoregressive Machine Translation: It's Not as Fast as it Seems
...
-
, et. al. ...
29 Jun 2022
29 Jun 2022

Improving Non-Autoregressive Speech Recognition with Autoregressive Pretraining
Yanjia Li ... Ivan Fung
-
Yanjia Li, et. al.Yanjia Li ... Ivan Fung
04 Jun 2023
04 Jun 2023

How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?
Weijia Xu ... Marine Carpuat
-
Weijia Xu, et. al.Weijia Xu ... Marine Carpuat
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Study of Non-autoregressive Model for Sequence Generation

Abstract

Highlights

Summary

Talk to us

Similar Papers