Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering

Gang Liu,Jinlong He,Pengfei Li,Zixu Zhao,Shenjun Zhong

doi:10.1016/j.jbi.2024.104748

Abstract

Medical Visual Question Answering (VQA) is a task that aims to provide answers to questions about medical images, which utilizes both visual and textual information in the reasoning process. The absence of large-scale annotated medical VQA datasets presents a formidable obstacle to training a medical VQA model from scratch in an end-to-end manner. Existing works have been using image captioning dataset in the pre-training stage and fine-tuning to downstream VQA tasks. Following the same paradigm, we use a collection of public medical image captioning datasets to pre-train multimodality models in a self-supervised setup, and fine-tune to downstream medical VQA tasks. In the work, we propose a method that featured with Cross-Modal pre-training with Multiple Objectives (CMMO), which includes masked image modelling, masked language modelling, image-text matching, and image-text contrastive learning. The proposed method is designed to associate the visual features of medical images with corresponding medical concepts in captions, for learning aligned vision and language feature representations, and multi-modal interactions. The experimental results reveal that our proposed CMMO method outperforms state-of-the-art methods on three public medical VQA datasets, showing absolute improvements of 2.6%, 0.9%, and 4.0% on the VQA-RAD, PathVQA, and SLAKE dataset, respectively. We also conduct comprehensive ablation studies to validate our method, and visualize the attention maps which show a strong interpretability. The code and pre-trained weights will be released at https://github.com/pengfeiliHEU/CMMO.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering

Abstract

Talk to us

Similar Papers

More From: Journal of Biomedical Informatics

Lead the way for us

Similar Papers

Medical visual question answering: A survey
Zhihong Lin ... Zongyuan Ge
Artificial Intelligence in Medicine | VOL. 143
Zhihong Lin, et. al.Zhihong Lin ... Zongyuan Ge
08 Jun 2023
Artificial Intelligence in Medicine | VOL. 143

Multiple Meta-model Quantifying for Medical Visual Question Answering
Tuong Do ... Minh Tran
-
Tuong Do, et. al.Tuong Do ... Minh Tran
01 Jan 2020
01 Jan 2020

Improving Automatic VQA Evaluation Using Large Language Models
Oscar Mañas ... Aishwarya Agrawal
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Oscar Mañas, et. al.Oscar Mañas ... Aishwarya Agrawal
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
Xinyue Hu ... Yingying Zhu
Medical Image Analysis | VOL. 97
Xinyue Hu, et. al.Xinyue Hu ... Yingying Zhu
20 Jul 2024
Medical Image Analysis | VOL. 97

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering

Abstract

Talk to us

Similar Papers

More From: Journal of Biomedical Informatics