Abstract

Neural dialogue models, despite their successes, still suffer from lack of relevance, diversity, and in many cases coherence in their generated responses. These issues can attributed to reasons including (1) short-range model architectures that capture limited temporal dependencies, (2) limitations of the maximum likelihood training objective, (3) the concave entropy profile of dialogue datasets resulting in short and generic responses, and (4) the out-of-vocabulary problem leading to generation of a large number of <UNK> tokens. On the other hand, transformer-based models such as GPT-2 have demonstrated an excellent ability to capture long-range structures in language modeling tasks. In this paper, we present DLGNet, a transformer-based model for dialogue modeling. We specifically examine the use of DLGNet for multi-turn dialogue response generation. In our experiments, we evaluate DLGNet on the open-domain Movie Triples dataset and the closed-domain Ubuntu Dialogue dataset. DLGNet models, although trained with only the maximum likelihood objective, achieve significant improvements over state-of-the-art multi-turn dialogue models. They also produce best performance to date on the two datasets based on several metrics, including BLEU, ROUGE, and distinct n-gram. Our analysis shows that the performance improvement is mostly due to the combination of (1) the long-range transformer architecture with (2) the injection of random informative paddings. Other contributing factors include the joint modeling of dialogue context and response, and the 100% tokenization coverage from the byte pair encoding (BPE).

Highlights

  • Recent successes of pretrained transformer-based language models, such as BERT (Devlin et al, 2019), GPT(-2) (Radford and Salimans, 2018; Radford et al, 2019), Transformer-XL (Dai et al, 2019), XLNet (Yang et al, 2019), and ERNIE(2.0) (Sun et al, 2019a,b), have led to state-of-the-art performance on many natural language understanding (NLU) tasks including sentence classification, named entity recognition, sentence similarity, and question answering

  • The transformer-based DLGNet provides a significant improvement in response generation performance over existing methods such as (V)HRED, hredGAN, DAIM, and adversarial bootstrapping, all of which are based on recurrent neural networks

  • DLGNet achieves the best performance to date on the Movie triples and Ubuntu dialogue datasets in terms of BLEU, ROUGE, and distinct n-gram scores

Read more

Summary

Introduction

Recent successes of pretrained transformer-based language models, such as BERT (Devlin et al, 2019), GPT(-2) (Radford and Salimans, 2018; Radford et al, 2019), Transformer-XL (Dai et al, 2019), XLNet (Yang et al, 2019), and ERNIE(2.0) (Sun et al, 2019a,b), have led to state-of-the-art performance on many natural language understanding (NLU) tasks including sentence classification, named entity recognition, sentence similarity, and question answering. The exceptional performance of transformer-based language models is due to their ability to capture long-term temporal dependencies in the input sequence. This attribute should be very beneficial to dialogue modeling, especially in multi-turn scenarios. Most of the existing neural dialogue response generation models are based on recurrent neural networks (Sutskever et al, 2014; Vinyals and Le, 2015; Li et al, 2016a; Serban et al, 2016; Xing et al, 2017; Serban et al, 2017b,a; Li et al, 2016b; Zhang et al, 2018a; Olabiyi et al, 2018, 2019a). Previous work points to some causes of these limitations: Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 54–62 July 9, 2020. c 2020 Association for Computational Linguistics i) Training data: The presence of high frequency generic utterances (utterance-level semantic redundancy), such as “I don’t know”, “I’m not sure”, and high frequency generic n-gram tokens (wordlevel syntactic redundancy), such as “I”, “I am”, leading to the concave positional entropy profile of dialogue datasets, see Fig. 1), which makes learning difficult, resulting in short and generic responses. ii) Short-range Model Architecture: Short-range model architectures that capture limited temporal dependencies. iii) Out-of-vocabulary Problem: Less frequent (usually more informative) words mapped to the out-of-vocabulary token , leading to generation of a large number of tokens. iv) Exposure Bias: The discrepancy in model behavior between training and inference, which limits the informativeness of the responses iv) Training Objective: The limitations of the maximum likelihood training objective

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call