Visual Question Answering System Research Articles

The modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with natural language capabilities is emerging as a necessity. Our work aims to advance visual question answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question-condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. First, we propose a surgical scene graph-based dataset, SSG-VQA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. We then propose SSG-VQA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module, which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. Our comprehensive analysis shows that our SSG-VQA dataset provides a more complex, diverse, geometrically grounded, unbiased and surgical action-oriented dataset compared to existing surgical VQA datasets and SSG-VQA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. We point out that the bottleneck of the current surgical visual question-answer model lies in learning the encoded representation rather than decoding the sequence. Our SSG-VQA dataset provides a diagnostic benchmark to test the scene understanding and reasoning capabilities of the model. The source code and the dataset will be made publicly available at: https://github.com/CAMMA-public/SSG-VQA .

Visual Question Answering (VQA) is a language-based method for analyzing images, which is highly helpful in assisting people with visual impairment. The VQA system requires a demonstrated holistic image understanding and conducts basic reasoning tasks concerning the image in contrast to the specific task-oriented models that simply classifies object into categories. Thus, VQA systems contribute to the growth of Artificial Intelligence (AI) technology by answering open-ended, arbitrary questions about a given image. In addition, VQA is also used to assess the system’s ability by conducting Visual Turing Test (VTT). However, because of the inability to generate the essential datasets and being incapable of evaluating the systems due to flawlessness and bias, the VQA system is incapable of assessing the system’s overall efficiency. This is seen as a possible and significant limitation of the VQA system. This, in turn, has a negative impact on the progress of performance observed in VQA algorithms. Currently, the research on the VQA system is dealing with more specific sub-problems, which include counting in VQA systems. The counting sub-problem of VQA is a more sophisticated one, riddling with several challenging questions, especially when it comes to complex counting questions such as those that demand object identifications along with detection of objects attributes and positional reasoning. The pooling operation that is considered to perform an attention mechanism in VQA is found to degrade the counting performance. A number of algorithms have been developed to address this issue. In this paper, we provide a comprehensive survey of counting techniques in the VQA system that is developed especially for answering questions such as “How many?”. However, the performance progress achieved by this system is still not satisfactory due to bias that occurs in the datasets from the way we phrase the questions and because of weak evaluation metrics. In the future, fully-fledged architecture, wide-size datasets with complex counting questions and a detailed breakdown in categories, and strong evaluation metrics for evaluating the ability of the system to answer complex counting questions, such as positional and comparative reasoning will be executed.

Visual Question Answering System Research Articles

Related Topics

Articles published on Visual Question Answering System

Visual Question Answering based Educational Tool for Medical Students using Cross-ViT

Advancing surgical VQA with scene graph knowledge.

Dual modality prompt learning for visual question-grounded answering in robotic surgery

BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining

Improving Automatic VQA Evaluation Using Large Language Models

Research on the Teaching Method of College Students’ Education Based on Visual Question Answering Technology

Novel approach to integrate various feature extraction techniques for the Visual Question Answering System with skeletal images in the healthcare sector

Counting in Visual Question Answering: Methods, Datasets, and Future Work

EVJVQA CHALLENGE: MULTILINGUAL VISUAL QUESTION ANSWERING

Counting-based visual question answering with serial cascaded attention deep learning

Multiscale Feature Extraction and Fusion of Image and Text in VQA

Vision–Language Model for Visual Question Answering in Medical Imagery

Medical knowledge-based network for Patient-oriented Visual Question Answering

A Neuro-Symbolic ASP Pipeline for Visual Question Answering

Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering

VisQA: X-raying Vision and Language Reasoning in Transformers

Improving users' mental model with attention‐directed counterfactual edits

Visual question answering: Which investigated applications?

A survey of methods, datasets and evaluation metrics for visual question answering

MedFuseNet:\xa0An\xa0attention-based multimodal deep learning model for visual question answering in the medical domain

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Visual Question Answering System Research Articles

Related Topics

Articles published on Visual Question Answering System

Visual Question Answering based Educational Tool for Medical Students using Cross-ViT

Advancing surgical VQA with scene graph knowledge.

Dual modality prompt learning for visual question-grounded answering in robotic surgery

BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining

Improving Automatic VQA Evaluation Using Large Language Models

Research on the Teaching Method of College Students’ Education Based on Visual Question Answering Technology

Novel approach to integrate various feature extraction techniques for the Visual Question Answering System with skeletal images in the healthcare sector

Counting in Visual Question Answering: Methods, Datasets, and Future Work

EVJVQA CHALLENGE: MULTILINGUAL VISUAL QUESTION ANSWERING

Counting-based visual question answering with serial cascaded attention deep learning

Multiscale Feature Extraction and Fusion of Image and Text in VQA

Vision–Language Model for Visual Question Answering in Medical Imagery

Medical knowledge-based network for Patient-oriented Visual Question Answering

A Neuro-Symbolic ASP Pipeline for Visual Question Answering

Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering

VisQA: X-raying Vision and Language Reasoning in Transformers

Improving users' mental model with attention‐directed counterfactual edits

Visual question answering: Which investigated applications?

A survey of methods, datasets and evaluation metrics for visual question answering

MedFuseNet:\xa0An\xa0attention-based multimodal deep learning model for visual question answering in the medical domain