VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question Answering

Aiman Lameesa,Chaklam Silpasuwanchai,Md Sakib Bin Alam

doi:10.1016/j.neucom.2024.128730

Abstract

Image and question matching is essential in Medical Visual Question Answering (MVQA) in order to accurately assess the visual-semantic correspondence between an image and a question. However, the recent state-of-the-art methods focus solely on the contrastive learning between an entire image and a question. Though contrastive learning successfully model the global relationship between an image and a question, it is less effective to capture the fine-grained alignments conveyed between image regions and question words. In contrast, large-scale pre-training poses significant drawbacks, including extended training times, handling substantial data volumes, and necessitating high computational power. To address these challenges, we propose the Vision-Guided Cross-Attention based Late Fusion (VG-CALF) network, which integrates image and question features into a unified deep model without relying on pre-training for MVQA tasks. In our proposed approach, we use self-attention to effectively leverage intra-modal relationships within each modality and implement vision-guided cross-attention to emphasize the inter-modal relationships between image regions and question words. By simultaneously considering intra-modal and inter-modal relationships, our proposed method significantly improves the overall performance of MVQA without the need for pre-training on extensive image-question pairs. Experimental results on benchmark datasets, such as, SLAKE and VQA-RAD demonstrate that our proposed approach performs competitively with existing state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question Answering

Abstract

Talk to us

Similar Papers

More From: Neurocomputing

Lead the way for us

Similar Papers

AMAM: An Attention-based Multimodal Alignment Model for Medical Visual Question Answering
Haiwei Pan ... Kun Shi
Knowledge-Based Systems | VOL. 255
Haiwei Pan, et. al.Haiwei Pan ... Kun Shi
27 Aug 2022
Knowledge-Based Systems | VOL. 255

A medical visual question answering approach based on co-attention networks
Wencheng Cui ... Hong Shao
Sheng wu yi xue gong cheng xue za zhi = Journal of biomedical engineering = Shengwu yixue gongchengxue zazhi | VOL. 41
Wencheng Cui, et. al.Wencheng Cui ... Hong Shao
25 Jun 2024
Sheng wu yi xue gong cheng xue za zhi = Journal of biomedical engineering = Shengwu yixue gongchengxue zazhi | VOL. 41

Multi-Modality Cross Attention Network for Image and Sentence Matching
Xi Wei ... Yan Li
-
Xi Wei, et. al.Xi Wei ... Yan Li
01 Jun 2020
01 Jun 2020

Medical visual question answering via corresponding feature fusion combined with semantic attention.
Han Zhu ... Linbo Qing
Mathematical biosciences and engineering : MBE | VOL. 19
Han Zhu, et. al.Han Zhu ... Linbo Qing
01 Jan 2021
Mathematical biosciences and engineering : MBE | VOL. 19

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question Answering

Abstract

Talk to us

Similar Papers

More From: Neurocomputing