Abstract
Video question generation is a challenging task in visual information retrieval, which generates questions given a sequence of video frames. The existing methods mainly tackle the problem of single-turn video question generation, but single-turn conversation usually can't meet the needs of video information acquisition. In this paper, we propose a new framework for single-turn VQG, which introduces attention mechanism to process inference of dialog history. And we introduce selection mechanism to choose from the candidate questions generated by each round of dialog history. In the framework, we leverage a recent video question answering model to predict the answer to the generated question and adopt the answer quality as rewards to fine-tune our model based on a reinforced learning mechanism. We also introduce a new task of multi-turn video question generation (M-VQG), which is generating multiple questions based on dialog history and video information to build conversation step by step. Our method achieves the state-of-the-art performance of the single-turn VQG task on two large-scale datasets, YouTube-Clips and TACoS-MultiLevel, and provides a baseline approach for M-VQG task.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have