Abstract

We propose an open-ended multimodal video question answering (VideoQA) method that predicts textual answers by referring to multimodal information derived from videos. Most current open-ended VideoQA methods focus on motion and appearance features from videos and ignore the audio features that are useful for understanding video content in more detail. A few prior works that use motion, appearance, and audio features showed poor results on public benchmarks since they failed to (e.g., region or grid-level) multimodal features effectively fuse the features with details for video reasoning. We overcame these limitations with multi-stream 3-dimensional convolutional networks (3D ConvNets) and a transformer-based modulator for VideoQA. Our network represents detailed motion and appearance features as well as an audio feature on multiple 3D ConvNets and modulates each intermediate representation with question information to extract their relevant spatiotemporal features over the frames. Based on the question content, our network fuses the multimodal information of 3D ConvNets and predicts the final answers. Our VideoQA method, which effectively combined multimodal data yields, outperformed both a previous multimodal VideoQA method and a state-of-the-art method on standard benchmarks. Visualization suggests that our method can predict the correct answers by listening to the audio information, even when the motion and appearance features are inadequate for understanding the video constant.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call