Visual question answering (VQA) has become a hot study topic with challenging motivation of correctly answering the videos or images questions in recent years. However, the existing VQA model mostly aimed at answering questions about images and performed poorly in the video question answering (VideoQA) domain. VideoQA needs to simultaneously consider the correlations between video frames and the dynamic information of multiple objects in video. Therefore, we propose a novel Cascade Transformers with Dynamic Attention for Video Question Answering (CTDA-QA), which aims to simultaneously solve the above considerations. Specifically, the proposed CTDA-QA model utilizes multiple transformers structure to encode videos for reasoning complex spatial and temporal information, which is different from the previous recurrent neural network methods. Besides, in order to effectively capture the dynamic information from various scenarios in videos, a flexible attention module has been proposed to explore the essential relations between objects in a dynamic timeline. Finally, to avoid spurious answers and fully explore the cross-modal relationships, a mixed-supervised learning strategy is designed for optimizing the reasoning tasks. The experiments on several benchmark video question–answer datasets clearly verify the performance and effectiveness of CTDA-QA, which contains the results in contrast to the state-of-the-art methods. Besides, the provided ablation study and visualization results further reveal the potential of CTDA-QA.
Read full abstract