Abstract

Visual dialogue systems need to understand dynamic visual scenes and comprehend semantics in order to converse with users. Constructing video dialogue systems is more challenging than traditional image dialogue systems because the large feature space of videos makes it difficult to capture semantic information. Furthermore, the dialogue system also needs to precisely answer users' question based on comprehensive understanding of the videos and the previous dialogue. In order to improve the performance of video dialogue system, we proposed an end-to-end recurrent cross-modality attention (ReCMA) model to answer a series of questions about a video from both visual and textual modality. The answer representation of the question is updated based on both visual representation and textual representation in each step of the reasoning process to have a better understanding of both modalities' information. We evaluate our method on the challenging DSTC7 video scene-aware dialog dataset and the proposed ReCMA achieves a relative 20.8% improvement over the baseline on CIDEr.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call