Abstract

In the recent past, dialogue systems have gained immense popularity and have become ubiquitous. During conversations, humans not only rely on languages but seek contextual information through visual contents as well. In every task-oriented dialogue system, the user is guided by the different aspects of a product or service that regulates the conversation towards selecting the product or service. In this work, we present a multi-modal conversational framework for a task-oriented dialogue setup that generates the responses following the different aspects of a product or service to cater to the user’s needs. We show that the responses guided by the aspect information provide more interactive and informative responses for better communication between the agent and the user. We first create a Multi-domain Multi-modal Dialogue (MDMMD) dataset having conversations involving both text and images belonging to the three different domains, such as restaurants, electronics, and furniture. We implement a Graph Convolutional Network (GCN) based framework that generates appropriate textual responses from the multi-modal inputs. The multi-modal information having both textual and image representation is fed to the decoder and the aspect information for generating aspect guided responses. Quantitative and qualitative analyses show that the proposed methodology outperforms several baselines for the proposed task of aspect-guided response generation.

Highlights

  • Conversational systems have become ubiquitous in our everyday lives

  • This is the first attempt to incorporate aspect information in the multi-modal dialogue systems to the best of our knowledge. (ii) We create a Multi-domain Multi-modal Dialogue (MDMMD) dataset comprising both text and images having conversations belonging to the three different domains, namely restaurant, electronics, and furniture. (iii) We propose a multi-modal graph convolutional framework for response generation while explicitly providing aspect information to the decoder to generate aspect-guided responses. (iv) The proposed model for both automatic and human evaluation shows its effectiveness over several baselines

  • As our research focus is on aspect-guided response generation in multi-modal dialogue systems, we see that the frameworks having aspect information outperforms the other baseline models

Read more

Summary

Introduction

Conversational systems have become ubiquitous in our everyday lives. Previous research suggests that the conversational agents need to be more interactive and informative for building engaging systems (Takayama and Arase, 2019; Shukla et al, 2019).These research indicates that engaging conversations include visual cues (e.g., a video or images) or audio cues (e.g., tone, the pitch of the speaker). Multi-modality in goal-oriented dialogue systems (Saha et al, 2018) for the fashion domain has established the significance of visual information for effective communication between the user and agent. Inspired by their works, we take a step forward by creating a multi-modal aspect guided response framework for a multi-domain goal-oriented dialogue system. The hidden representation of a 1-layer GCN is a matrix H ∈ Ri×p where each p-dimensional representation of a node captures the interaction with its 1-hop neighbors. Multiple layers of GCNs can be stacked together to seize interactions with nodes that are several hops away. Node v representation after the mth layer of GCN can be formulated as: 3.2 Dataset Statistics

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call