Multi-Speaker Video Dialog with Frame-Level Temporal Localization

Qiang Wang,Zhiyi Guo,Zhou Zhao,Yahong Han,Pin Jiang

doi:10.1609/aaai.v34i07.6901

Abstract

To simulate human interaction in real life, dialog systems are introduced to generate a response to previous chat utterances. There have been several studies for two-speaker video dialogs in the form of question answering. However, more informative semantic cues might be exploited via a multi-rounds chatting or discussing about the video among multiple speakers. So multi-speakers video dialogs are more applicable in real life. Besides, speakers always chat about a sub-segment of the long video fragment for a period of time. Current video dialog systems require to be directly given the relevant video sub-segment which speakers are chatting about. However, it is always hard to accurately spot the corresponding video sub-segment in practical applications. In this paper, we introduce a novel task of Multi-Speaker Video Dialog with frame-level Temporal Localization (MSVD-TL) to make video dialog systems more applicable. Given a long video fragment and a set of chat history utterances, MSVD-TL targets to predict the following response and localize the relevant video sub-segment in frame level, simultaneously. We develop a new multi-task model with a response prediction module and a frame-level temporal localization module. Besides, we focus on the characteristic of the video dialog generation process and exploit the relation among the video fragment, the chat history, and the following response to refine their representations. We evaluate our approach for both the Multi-Speaker Video Dialog without frame-level temporal localization (MSVD w/o TL) task and the MSVD-TL task. The experimental results further demonstrate that MSVD-TL enhances the applicability of video dialog in real life.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-Speaker Video Dialog with Frame-Level Temporal Localization

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence	Publication Date: Apr 3, 2020
Citations: 2

Similar Papers

Smart Enough to Talk With Us? Foundations and Challenges for Dialogue Capable AI Systems
Barbara J Grosz
Computational Linguistics | VOL. 44
Barbara J GroszBarbara J Grosz
01 Mar 2018
Computational Linguistics | VOL. 44

Dimensions of Theatricality in Africa
Joachim Fiebach
Research in African Literatures | VOL. 30
Joachim FiebachJoachim Fiebach
01 Dec 1999
Research in African Literatures | VOL. 30

Conversation System of an Everyday Robot Robovie-IV
Noriaki Mitsunaga ... Takahiro Miyashita
-
Noriaki Mitsunaga, et. al.Noriaki Mitsunaga ... Takahiro Miyashita
01 Jun 2007
01 Jun 2007

Virtual worlds: Relationship between real life and experience in Second Life
Scott P Anstadt ... Lesley L Medley
The International Review of Research in Open and Distributed Learning | VOL. 14
Scott P Anstadt, et. al.Scott P Anstadt ... Lesley L Medley
30 Sep 2013
The International Review of Research in Open and Distributed Learning | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Speaker Video Dialog with Frame-Level Temporal Localization

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence