Abstract

In this paper, we present a Grad-Cam aware supervised attention framework for visual question answering (VQA) tasks for post-disaster damage assessment purposes. Visual-attention in visual question-answering tasks aims to focus on relevant image regions according to questions to predict answers. However, the conventional attention mechanisms in VQA work in an unsupervised manner, learning to give importance to visual contents by minimizing only task-specific loss. This approach fails to provide appropriate visual attention where the visual contents are very complex. The content and nature of UAV images in FlooNet-VQA dataset are very complex as they depict the hazardous scenario after Hurricane Harvey from a high altitude. To tackle this, we propose a supervised attention mechanism that uses explainable features from Grad-Cam to supervise visual attention in the VQA pipeline. The mechanism we propose operates in two stages. In the first stage of learning, we derived the visual explanations through Grad-Cam by training a baseline attention-based VQA model. In the second stage, we supervise our visual content for each question by incorporating the Grad-Cam explanations from the previous phase of the training process. We have improved the model performance over the state-of-the-art VQA models by a considerable margin on FloodNet dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call