Abstract

Video based visual question answering (V-VQA) remains challenging at the intersection of vision and language. In this paper, we propose a novel architecture, namely Generalized Pyramid Co-attention with Learnable Aggregation Net (GPC) to address two central problems: 1) how to deploy co-attention to V-VQA task considering the complex and diverse content of videos; and 2) how to aggregate the frame-level features (or word-level features) without destroying the feature distributions and temporal information. To solve the first problem, we propose a Generalized Pyramid Co-attention structure with a novel diversity learning module to explicitly encourage attention accuracy and diversity. And we first instantiate it into a Multi-path Pyramid Co-attention (MPC) to capture diverse feature. Then we find each attention branch of original co-attention mechanism does not interact with the others, which results in coarse attention maps. So we extend the MPC structure to a Cascaded Pyramid Transformer Co-attention (CPTC) module in which we replace co-attention with transformer co-attention. To solve the second problem, we propose a new learnable aggregation method with a set of evidence gates. It automatically aggregates adaptively-weighted frame-level features (or word-level features) to extract rich video (or question) context semantic information. With evidence gates, it then further chooses the most related signals representing the evidence information to predict the answer. Extensive validations on the two V-VQA datasets, TGIF-QA and TVQA show that both our proposed MPC and CPTC achieve the state-of-the-art performance and CPTC performs better under various settings and metrics. Code and model have been released at:https://github.com/lixiangpengcs/LAD-Net-for-VideoQA.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.