Abstract

Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of inter-modality interactions, which may be not enough to model latent complex image-question relations that are necessary for accurately answering questions. Besides, most methods neglect the modeling of the intra-modality interactions that is also important to VQA. In this work, we propose a novel DenIII framework for modeling dense inter- and intra-modality interactions. It densely connects all pairwise layers of the network via the proposed Inter- and Intra-modality Attention Connectors, capturing fine-grained interplay across all hierarchical levels. The Inter-modality Attention Connector efficiently connects the multi-modality features at any two layers with bidirectional attention, capturing the inter-modality interactions. While the Intra-modality Attention Connector connects the features of the same modality with unidirectional attention, and models the intra-modality interactions. Extensive ablation studies and visualizations validate the effectiveness of our method, and DenIII achieves state-of-the-art or competitive performance on three publicly available datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.