MedFuseNet:\xa0An\xa0attention-based multimodal deep learning model for visual question answering in the medical domain

Dhruv Sharma,Sanjay Purushotham,Chandan K Reddy

doi:10.1038/s41598-021-98390-1

Dhruv Sharma, Sanjay Purushotham + Show 1 more

Open Access

https://doi.org/10.1038/s41598-021-98390-1

Copy DOI

Abstract

Medical images are difficult to comprehend for a person without expertise. The scarcity of medical practitioners across the globe often face the issue of physical and mental fatigue due to the high number of cases, inducing human errors during the diagnosis. In such scenarios, having an additional opinion can be helpful in boosting the confidence of the decision maker. Thus, it becomes crucial to have a reliable visual question answering (VQA) system to provide a ‘second opinion’ on medical cases. However, most of the VQA systems that work today cater to real-world problems and are not specifically tailored for handling medical images. Moreover, the VQA system for medical images needs to consider a limited amount of training data available in this domain. In this paper, we develop MedFuseNet, an attention-based multimodal deep learning model, for VQA on medical images taking the associated challenges into account. Our MedFuseNet aims at maximizing the learning with minimal complexity by breaking the problem statement into simpler tasks and predicting the answer. We tackle two types of answer prediction—categorization and generation. We conducted an extensive set of quantitative and qualitative analyses to evaluate the performance of MedFuseNet. Our experiments demonstrate that MedFuseNet outperforms the state-of-the-art VQA methods, and that visualization of the captured attentions showcases the intepretability of our model’s predicted results.

Highlights

Medical images are difficult to comprehend for a person without expertise
We quantitatively evaluate the performance of MedFuseNet and compare it with the baseline models described in the “visual question answering (VQA) baseline models for comparison” section for the tasks of answer categorization and answer generation
Whereas the Bilinear Attention Networks (BAN) model is more competitive to MedFuseNet model for category 3, while the BAN model under-performs our model for category 1 by 2 percent and category 2 by 1.4 percent

Summary

Related works

We first provide an overview of related works for VQA tasks for real-world and medical domains, and discuss the related works on components of VQA approaches. A typical model for VQA first extracts the feature vectors from multiple modalities (image and question text), and combines the vectors using any one of the above-stated fusion techniques, and predicts the answer from the fused vector. It uses this attended vector as an input to the image attention mechanism as described in Algorithm 1 from lines 8-18, instead of question feature vector q. As shown in Algorithm 1 (lines 1–12), the MedFuseNet first extracts the feature vectors vand qfor input image v and question q, respectively This is followed by the computation of the attended question features qe using question attention mechanism Eq(q).

Experiments

Methods

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Reports	Publication Date: Oct 6, 2021
Citations: 37	License type: open-access

R Discovery Prime

R Discovery Prime

MedFuseNet:\xa0An\xa0attention-based multimodal deep learning model for visual question answering in the medical domain

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports

Lead the way for us

Similar Papers

Medical knowledge-based network for Patient-oriented Visual Question Answering
Jian Huang ... Wenyin Liu
Information Processing & Management | VOL. 60
Jian Huang, et. al.Jian Huang ... Wenyin Liu
21 Dec 2022
Information Processing & Management | VOL. 60

A VQA System for Medical Image Classification Using Transfer Learning
C Dhanush ... Anita Kanavalli
-
C Dhanush, et. al.C Dhanush ... Anita Kanavalli
01 Jan 2020
01 Jan 2020

VQAR: Review on Information Retrieval Techniques based on Computer Vision and Natural Language Processing
Shivangi Modi ... Dhatri Pandya
-
Shivangi Modi, et. al.Shivangi Modi ... Dhatri Pandya
01 Mar 2019
01 Mar 2019

Counting in Visual Question Answering: Methods, Datasets, and Future Work
Tesfayee Meshu Welde ... Lejian Liao
International Journal of Image and Graphics | VOL. -
Tesfayee Meshu Welde, et. al.Tesfayee Meshu Welde ... Lejian Liao
20 Oct 2023
International Journal of Image and Graphics | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MedFuseNet:\xa0An\xa0attention-based multimodal deep learning model for visual question answering in the medical domain

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports