Visual Question Research Articles

Existing Visual Question Answering models suffer from the language prior, where the answers provided by the models overly rely on the correlations between questions and answers, ignoring the exact visual information, resulting in a significant drop in the out-of-distribution datasets. To eliminate such language bias, prevalent approaches mainly focus on weakening the language prior with one auxiliary question-only branch while focusing on the statistical question type–answer pairs’ distribution prior rather than that of question–answer pairs. Besides, most models provide the answer with improper visual groundings. This paper proposes a model-agnostic framework to address the above drawbacks by question-conditioned debiasing with focal visual context fusion. To begin with, instead of the question type-conditioned correlations, we overcome the language distribution shortcut from the aspect of question-conditioned correlations by removing the shortcut between questions and the most occurring answer. Additionally, we utilize the deviation of the predicted answer distribution and ground truth as the pseudo target to avoid the model falling into other frequent answers’ distribution bias. Further, we stress the imbalance of the number of images and questions that post higher requirements of a proper visual context. We improve the correct visual utilization ability based on contrastive sampling and design a focal visual context fusion module that incorporates the critical object word extracted from the question after the Part-Of-Speech tagging into the visual features to augment the salient visual information without human annotations. Extensive experiments on the three public benchmark datasets, i.e., VQA v2, VQA-CP v2, and VQA-CP v1, demonstrate the effectiveness of our model.

The counting-based questions play a major part in Visual Question Answering (VQA), the most challenging factor is counting the different objects present in the images. Recently more attention is paid to design a model of count-aided VQA. Based on the questions, the VQA system responds with appropriate answers. Yet, the complex questions are necessitating in the system with answers. The earlier models are still facing the challenging problems of counting the various objects within the images as the models become futile to select the features and lack fine-grained representation. In order to sustain the image representation, this paper proposes a new model for VQA using the heuristic approach of serial cascaded deep learning methods. Initially, the standard data regarding images and text data are gathered and fed to the pre-processing process. Consequently, the feature extraction is done on both the image and the text data. Here, the deep features from images are taken using Visual Geometry Group 16 (VGG16) and the text features are extracted using Text Convolutional Neural Network (TCNN). Then, the optimal weighted fused features are obtained, where the weights used for getting the necessary features are tuned via the Improved Tuna Swarm Optimization (ITSO) algorithm. Finally, the counting answers are retrieved based on the given queries, which is carried out via Serial Cascaded Recurrent Neural Network with Attention Mechanism-based Long Short-Term Memory (SCRAM-LSTM). The performance is examined with divergent metrics compared with conventional models. Hence, the findings reveal that it offers superior performance in estimating the appropriate answers. Therefore, the proposed work is widely used for such potential applications as helping blind or visually impaired people to get information, integrating with image retrieval systems, and also for search engines. Especially, it is utilized for the vision and language systems.

Visual Question Research Articles

Related Topics

Articles published on Visual Question

Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering

Design of knowledge incorporated VQA based on spatial GCNN with structured sentence embedding and linking algorithm

Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Graph convolutional network for difficulty-controllable visual question generation

Symmetric Graph-Based Visual Question Answering Using Neuro-Symbolic Approach

Context-aware Multi-level Question Embedding Fusion for visual question answering

A Comprehensive Review and Open Challenges on Visual Question Answering Models

A Simple Framework for Scene Graph Reasoning with Semantic Understanding of Complex Sentence Structure

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations.

An Effective Med-VQA Method Using a Transformer with Weights Fusion of Multiple Fine-Tuned Models

Transformer-Based Relational Inference Network for Complex Visual Relational Reasoning

Visual question generation for explicit questioning purposes based on target objects

Improving visual question answering for bridge inspection by pre‐training with external data of image–text pairs

SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering

Question-conditioned debiasing with focal visual context fusion for visual question answering

Dual-feature collaborative relation-attention networks for visual question answering

Investigation of Available Datasets and Techniques for Visual Question Answering

Counting-based visual question answering with serial cascaded attention deep learning

General Greedy De-bias Learning.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Visual Question Research Articles

Related Topics

Articles published on Visual Question

Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering

Design of knowledge incorporated VQA based on spatial GCNN with structured sentence embedding and linking algorithm

Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Graph convolutional network for difficulty-controllable visual question generation

Symmetric Graph-Based Visual Question Answering Using Neuro-Symbolic Approach

Context-aware Multi-level Question Embedding Fusion for visual question answering

A Comprehensive Review and Open Challenges on Visual Question Answering Models

A Simple Framework for Scene Graph Reasoning with Semantic Understanding of Complex Sentence Structure

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations.

An Effective Med-VQA Method Using a Transformer with Weights Fusion of Multiple Fine-Tuned Models

Transformer-Based Relational Inference Network for Complex Visual Relational Reasoning

Visual question generation for explicit questioning purposes based on target objects

Improving visual question answering for bridge inspection by pre‐training with external data of image–text pairs

SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering

Question-conditioned debiasing with focal visual context fusion for visual question answering

Dual-feature collaborative relation-attention networks for visual question answering

Investigation of Available Datasets and Techniques for Visual Question Answering

Counting-based visual question answering with serial cascaded attention deep learning

General Greedy De-bias Learning.