Abstract

Visual question answering (VQA) for remote sensing (RS) images offers a typical multi-modal task advanced by natural language processing and computer vision technologies. Still, it remains a problem subjected to two aspects of factors. First, the RS image contains a wealth of visual elements but is rarely involved in the question; therefore, the regions humans choose to look at to answer questions differ from current attention approaches. Second, the class-imbalance problem in the existing RSVQA dataset leads the prediction to drift towards frequent answers. To address these issues, we aim to explore the intrinsic relationship between RS visual elements and text generations with imbalance compensation. Here, we propose a new method called Union Context-wise and Alternate-Guided Attention Network (UCAGAN). Our method uses a cross-modal alternative-guided attention module to map visual and textual features. Moreover, we introduce an improved multi-category loss function to compensate for the model bias caused by sample imbalance. We carried out thorough experimentation on a diverse range of datasets, demonstrating that our approach is effective and efficient, achieving state-of-the-art performance. Our work provides results that are not only correctable but also explainable, ultimately leading to the development of reliable VQA models for RS images.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call