Multimodal grid features and cell pointers for scene text visual question answering

Lluís Gómez,Rubén Tito,Ali Furkan Biten,Ernest Valveny,Marçal Rusiñol,Andrés Mafla,Dimosthenis Karatzas

doi:10.1016/j.patrec.2021.06.026

Abstract

This paper presents a new model for the task of scene text visual question answering. In this task questions about a given image can only be answered by reading and understanding scene text. Current state of the art models for this task make use of a dual attention mechanism in which one attention module attends to visual features while the other attends to textual features. A possible issue with this is that it makes difficult for the model to reason jointly about both modalities. To fix this problem we propose a new model that is based on an single attention mechanism that attends to multi-modal features conditioned to the question. The output weights of this attention module over a grid of multi-modal spatial features are interpreted as the probability that a certain spatial location of the image contains the answer text to the given question. Our experiments demonstrate competitive performance in two standard datasets with a model that is ×5 faster than previous methods at inference time. Furthermore, we also provide a novel analysis of the ST-VQA dataset based on a human performance study. Supplementary material, code, and data is made available through this link.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multimodal grid features and cell pointers for scene text visual question answering

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition Letters

Lead the way for us

Journal: Pattern Recognition Letters	Publication Date: Oct 1, 2021
Citations: 14

Similar Papers

Case instance segmentation of small farmland based on Mask R-CNN of feature pyramid network with double attention mechanism in high resolution satellite images
Yangyang Cao ... Houcheng Yang
Computers and Electronics in Agriculture | VOL. 212
Yangyang Cao, et. al.Yangyang Cao ... Houcheng Yang
26 Jul 2023
Computers and Electronics in Agriculture | VOL. 212

Multi-modality hierarchical attention networks for defect identification in pipeline MFL detection
Gang Wang ... Xusheng Sun
Measurement Science and Technology | VOL. 35
Gang Wang, et. al.Gang Wang ... Xusheng Sun
05 Aug 2024
Measurement Science and Technology | VOL. 35

Recurrent convolutional video captioning with global and local attention
Tao Jin ... Zhongfei Zhang
Neurocomputing | VOL. 370
Tao Jin, et. al.Tao Jin ... Zhongfei Zhang
27 Aug 2019
Neurocomputing | VOL. 370

One Spatio-Temporal Sharpening Attention Mechanism for Light-Weight YOLO Models Based on Sharpening Spatial Attention.
Mengfan Xue ... Minghao Chen
Sensors | VOL. 21
Mengfan Xue, et. al.Mengfan Xue ... Minghao Chen
28 Nov 2021
Sensors | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multimodal grid features and cell pointers for scene text visual question answering

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition Letters