A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors

Anand Singh Jalal,Himanshu Sharma

doi:10.1016/j.eswa.2021.116159

Abstract

Text contained in an image gives useful information about that image. Consider a warning signboard with text “high voltage”; it indicates the hazard or risk involved in the image. Thus, this semantic textual information can be very useful for better understanding of images, which is not utilized by the existing visual question answering (VQA) models. However, the presence of this textual information in images can strongly guide the VQA task. This work deal with the task of visual question answering by exploiting these textual cues together with the visual content to boost the accuracy of VQA models. In the work, a novel VQA model is proposed based on the PHOC and fisher vector based representation. Based on the PHOCs of the scene-text, we have constructed a powerful descriptor by using a Fisher Vectors. Also, the proposed model uses transformer model together with dynamic pointer networks for answer decoding process. Thus, the proposed model uses a sequence of decoding steps for answer generation instead of just assuming answer prediction as a classification problem as considered by previous works. We have shown the qualitative and quantitative results on three popular datasets: VQA 2.0, TextVQA and ST-VQA. The results show the effectiveness of the proposed model over the existing models.

Full Text