Abstract

Video question answering (VideoQA) has attracted much interest from of scholars as one of the most representative multimodal tasks in recent years. The task requires the model to interact and reason between the video and the question. Most known approaches use pre-trained networks to extract complex embeddings of videos and questions independently before performing multimodal fusion. However, they overlook two factors: (1) These feature extractors are pre-trained for the image or video classification task without taking the question into consideration, therefore may not be suitable for VideoQA task. (2) Using multiple feature extractors to extract features at different levels introduce more irrelevant information to some extent, thus making the task more difficult. For the above reasons, we propose a new model named Spatio-Temporal Two-Stage Fusion, which ties together multiple levels of feature extraction processes and divides them into two distinct stages: spatial fusion and temporal fusion. Specifically, in the spatial fusion stage, we use Vision Transformer to integrate the intra-frame information to generate frame-level features. At the same time, we design a multimodal temporal fusion module that enables the video to fuse textual information and assign different levels of attention to each frame. Then the obtained frame-level features are used to generate global video features by another Vision Transformer. In order to efficiently generate modal interaction information, we design a video–text symmetric fusion module to retain the most relevant information by mutual guidance between the two modalities. Our method is evaluated on three benchmark datasets: MSVD-QA, MSRVTT-QA and TGIF-QA, and achieves competitive results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call