Autoregressive Knowledge Distillation through Imitation Learning

Alexander Lin,Jeremy Wohlwend,Tao Lei,Howard Chen

doi:10.18653/v1/2020.emnlp-main.494

Abstract

The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, time-sensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.

Highlights

IntroductionDue to the sequential nature of text generation, they are often the tool of choice for tackling sequence-to-sequence problems such as translation (Sutskever et al, 2014), summarization (Rush et al, 2015), and dialogue (Eric and Manning, 2017)
Autoregressive models are ubiquitous in natural language processing
Two recent trends have made autoregressive models cumbersome to deploy in real-world, natural language generation (NLG) applications

Summary

Introduction

Due to the sequential nature of text generation, they are often the tool of choice for tackling sequence-to-sequence problems such as translation (Sutskever et al, 2014), summarization (Rush et al, 2015), and dialogue (Eric and Manning, 2017) They form the backbone of several successful generative pre-training architectures (Howard and Ruder, 2018; Peters et al, 2018; Radford et al, 2019; Dai et al, 2019). The joint distribution over y may itself be conditional on some related source feature x ∈ X (e.g. translation, summarization) or not (e.g. language modeling) Since the former case can generalize the latter by letting X = ∅, we will specify the presence of x in the rest of the paper. The training objective can be expressed as

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Autoregressive Knowledge Distillation through Imitation Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 24	License type: cc-by

Similar Papers

Coupling Context Modeling with Zero Pronoun Recovering for Document-Level Natural Language Generation
...
-
, et. al. ...
15 Oct 2021
15 Oct 2021

Coupling Context Modeling with Zero Pronoun Recovering for Document-Level Natural Language Generation
Xin Tan ... Guodong Zhou
-
Xin Tan, et. al.Xin Tan ... Guodong Zhou
01 Jan 2020
01 Jan 2020

MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation
...
-
, et. al. ...
21 Oct 2021
21 Oct 2021

Automatic Generation of News Comments Based on Gated Attention Neural Networks
Hai-Tao Zheng ... Arun Kumar Sangaiah
IEEE Access | VOL. 6
Hai-Tao Zheng, et. al.Hai-Tao Zheng ... Arun Kumar Sangaiah
01 Jan 2018
IEEE Access | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Autoregressive Knowledge Distillation through Imitation Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers