Abstract
Due to the complex backgrounds and diverse attributes of salient objects in optical remote sensing images (RSIs), existing methods heavily rely on pixel-level annotations, with sparse annotations receiving limited attention. Furthermore, the restricted information in sparse annotations results in a significant performance gap between weakly supervised salient object detection models and fully supervised models. To address these limitations, this paper proposes a novel Multi-source Information Fusion Attention Network (MIFA-Net). MIFA-Net employs an encoder–decoder architecture comprising Boundary Detection Block (BDB), Region Activation Block (RAB), and Multi-source Information Fusion Block (MIFB). The encoder leverages a pre-trained VGG-16 model to extract basic features from input images. The BDB, supervised by pseudo-boundary masks, detects the boundaries of salient objects from the basic features, while the RAB, under the supervision of image-level class labels, activates the region information of salient objects. Acting as the decoder, the MIFB gradually integrates various information types, restoring high-quality optical remote sensing saliency maps through scribble annotations. Additionally, we introduce a deep supervision strategy and define a comprehensive loss function to constrain the training process. Experimental results on two benchmark datasets demonstrate that MIFA-Net significantly outperforms weakly supervised models, achieving an Sα of 0.905 and reducing M to 0.008 on EORSSD. This performance is comparable to, or even surpasses, fully supervised models.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have