Abstract

Text contained in an image gives useful information about that image. Consider a warning signboard with text “high voltage”; it indicates the hazard or risk involved in the image. Thus, this semantic textual information can be very useful for better understanding of images, which is not utilized by the existing visual question answering (VQA) models. However, the presence of this textual information in images can strongly guide the VQA task. This work deal with the task of visual question answering by exploiting these textual cues together with the visual content to boost the accuracy of VQA models. In the work, a novel VQA model is proposed based on the PHOC and fisher vector based representation. Based on the PHOCs of the scene-text, we have constructed a powerful descriptor by using a Fisher Vectors. Also, the proposed model uses transformer model together with dynamic pointer networks for answer decoding process. Thus, the proposed model uses a sequence of decoding steps for answer generation instead of just assuming answer prediction as a classification problem as considered by previous works. We have shown the qualitative and quantitative results on three popular datasets: VQA 2.0, TextVQA and ST-VQA. The results show the effectiveness of the proposed model over the existing models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call