Parallel multi-head attention and term-weighted question embedding for medical visual question answering.

Sruthy Manmadhan,Binsu C Kovoor

doi:10.1007/s11042-023-14981-2

Abstract

The goal of medical visual question answering (Med-VQA) is to correctly answer a clinical question posed by a medical image. Medical images are fundamentally different from images in the general domain. As a result, using general domain Visual Question Answering (VQA) models to the medical domain is impossible. Furthermore, the large-scale data required by VQA models is rarely available in the medical arena. Existing approaches of medical visual question answering often rely on transfer learning with external data to generate good image feature representation and use cross-modal fusion of visual and language features to acclimate to the lack of labelled data. This research provides a new parallel multi-head attention framework (MaMVQA) for dealing with Med-VQA without the use of external data. The proposed framework addresses image feature extraction using the unsupervised Denoising Auto-Encoder (DAE) and language feature extraction using term-weighted question embedding. In addition, we present qf-MI, a unique supervised term-weighting (STW) scheme based on the concept of mutual information (MI) between the word and the corresponding class label. Extensive experimental findings on the VQA-RAD public medical VQA benchmark show that the proposed methodology outperforms previous state-of-the-art methods in terms of accuracy while requiring no external data to train the model. Remarkably, the presented MaMVQA model achieved significantly increased accuracy in predicting answers to both close-ended (78.68%) and open-ended (55.31%) questions. Also, an extensive set of ablations are studied to demonstrate the significance of individual components of the system.

Full Text