Video-Grounded Dialogues with Pretrained Generation Language Models

Hung Le,Steven C.H Hoi

doi:10.18653/v1/2020.acl-main.518

Abstract

Pre-trained language models have shown remarkable success in improving various downstream NLP tasks due to their ability to capture dependencies in textual data and generate natural responses. In this paper, we leverage the power of pre-trained language models for improving video-grounded dialogue, which is very challenging and involves complex features of different dynamics: (1) Video features which can extend across both spatial and temporal dimensions; and (2) Dialogue features which involve semantic dependencies over multiple dialogue turns. We propose a framework by extending GPT-2 models to tackle these challenges by formulating video-grounded dialogue tasks as a sequence-to-sequence task, combining both visual and textual representation into a structured sequence, and fine-tuning a large pre-trained GPT-2 network. Our framework allows fine-tuning language models to capture dependencies across multiple modalities over different levels of information: spatio-temporal level in video and token-sentence level in dialogue context. We achieve promising improvement on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark from DSTC7, which supports a potential direction in this line of research.

Highlights

Recent work in large-scale pre-training transformerbased neural networks (Liu et al, 2019; Devlin et al, 2019; Radford et al, 2019) has boosted the performance in various NLP tasks
Similar to pre-trained CNN-based neural networks developed in computer vision research (He et al, 2016; Huang et al, 2017) which can learn highresolution features in images, pre-trained language models (LMs) are capable of capturing fine-grain textual dependencies in text data of rich semantics
While the benefits of pre-trained language models present in many downstream NLP tasks such as machine translation and question answering (QA) (Devlin et al, 2019; Lan et al, 2020), they are suitable to adapt to dialogue response generation tasks for two major reasons: (1) Dialogue response generation usually involves more complex dynamics between input and output text sequences

Summary

Introduction

Recent work in large-scale pre-training transformerbased neural networks (Liu et al, 2019; Devlin et al, 2019; Radford et al, 2019) has boosted the performance in various NLP tasks. Similar to pre-trained CNN-based neural networks developed in computer vision research (He et al, 2016; Huang et al, 2017) which can learn highresolution features in images, pre-trained language models (LMs) are capable of capturing fine-grain textual dependencies in text data of rich semantics. While the benefits of pre-trained language models present in many downstream NLP tasks such as machine translation and question answering (QA) (Devlin et al, 2019; Lan et al, 2020), they are suitable to adapt to dialogue response generation tasks for two major reasons: (1) Dialogue response generation usually involves more complex dynamics between input and output text sequences. We are motivated by these observations to adapt pre-trained language models into a dialogue task and improve the quality of generated responses

Objectives

Methods

Results

Conclusion