Image Question Answering Research Articles

Textbook question answering (TQA) is a task that one should answer non-diagram and diagram questions accurately, given a large context which consists of abundant diagrams and essays. Although lots of studies have made significant progress in the natural image question answering (QA), they are not applicable to comprehending diagrams and reasoning over the long multimodal context. To address the above issues, we propose a relation-aware fine-grained reasoning (RAFR) network that performs fine-grained reasoning over the nodes of relation-based diagram graphs. Our method uses semantic dependencies and relative positions between nodes in the diagram to construct relation graphs and applies graph attention networks to learn diagram representations. To extract and reason over the multimodal knowledge, we first extract the text that is the most relevant to questions, options, and the instructional diagram which is the most relevant to question diagrams at the word-sentence level and the node-diagram level, respectively. Then, we apply instructional-diagram-guided attention and question-guided attention to reason over the node of question diagrams, respectively. The experimental results show that our proposed method achieves the best performance on the TQA dataset compared with baselines. We also conduct extensive ablation studies to comprehensively analyze the proposed method.

Read full abstract

Visual Question Answering (VQA) is to provide a natural language answer for a pair of an image or video and a natural language question. Despite recent progress on VQA, existing works primarily focus on image question answering and are suboptimal for video question answering. This article presents a novel Spatiotemporal-Textual Co-Attention Network (STCA-Net) for video question answering. The STCA-Net jointly learns spatially and temporally visual attention on videos as well as textual attention on questions. It concentrates on the essential cues in both visual and textual spaces for answering question, leading to effective question-video representation. In particular, a question-guided attention network is designed to learn question-aware video representation with a spatial-temporal attention module. It concentrates the network on regions of interest within the frames of interest across the entire video. A video-guided attention network is proposed to learn video-aware question representation with a textual attention module, leading to fine-grained understanding of question. The learned video and question representations are used by an answer predictor to generate answers. Extensive experiments on two challenging datasets of video question answering, i.e., MSVD-QA and MSRVTT-QA, have shown the effectiveness of the proposed approach.

Read full abstract

Image Question Answering Research Articles

Related Topics

Articles published on Image Question Answering

Relation-Aware Fine-Grained Reasoning Network for Textbook Question Answering.

Illation of Video Visual Relation Detection Based on Graph Neural Network

Multimodal deep fusion for image question answering

Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering

Multichannel Attention Refinement for Video Question Answering

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Scientific and Anatomical Character Reading from the Face Using Deep Learning

The forgettable-watcher model for video question answering

Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering

Learning to Answer Questions from Image Using Convolutional Neural Network

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Image Question Answering Research Articles

Related Topics

Articles published on Image Question Answering

Relation-Aware Fine-Grained Reasoning Network for Textbook Question Answering.

Illation of Video Visual Relation Detection Based on Graph Neural Network

Multimodal deep fusion for image question answering

Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering

Multichannel Attention Refinement for Video Question Answering

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Scientific and Anatomical Character Reading from the Face Using Deep Learning

The forgettable-watcher model for video question answering

Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering

Learning to Answer Questions from Image Using Convolutional Neural Network