Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations.

Xianli Sheng

doi:10.1371/journal.pone.0290315

Abstract

Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answering, address the performance bottleneck issue caused by over-fitting risk in existing self-attention-based models, and propose a scenario text visual question answering method called INT2-VQA that fuses knowledge manifestation based on inter-modality and intra-modality collaborations. Specifically, we model the complementary priori knowledge of locational collaboration between visual targets and textual targets across modalities and the contextual semantical collaboration among textual word targets within a modality. Based on this, a universal knowledge-reinforced attention module is designed to achieve a unified encoding manifestation of both relations. Extensive ablation experiments, contrast experiments, and visual analyses demonstrate the effectiveness of the proposed method and prove its superiority over the other state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PloS one	Publication Date: Aug 30, 2023
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations.

Abstract

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

Visual Question Generation as Dual Task of Visual Question Answering
Yikang Li ... Xiaogang Wang
-
Yikang Li, et. al.Yikang Li ... Xiaogang Wang
01 Jun 2018
01 Jun 2018

Improving Automatic VQA Evaluation Using Large Language Models
Oscar Mañas ... Aishwarya Agrawal
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Oscar Mañas, et. al.Oscar Mañas ... Aishwarya Agrawal
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Incorporating 3D Information Into Visual Question Answering
Yue Qiu ... Ryota Suzuki
-
Yue Qiu, et. al.Yue Qiu ... Ryota Suzuki
01 Sep 2019
01 Sep 2019

VQA: Visual Question Answering
Aishwarya Agrawal ... C Lawrence Zitnick
International Journal of Computer Vision | VOL. 123
Aishwarya Agrawal, et. al.Aishwarya Agrawal ... C Lawrence Zitnick
08 Nov 2016
International Journal of Computer Vision | VOL. 123

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations.

Abstract

Talk to us

Similar Papers

More From: PloS one