Existing remote sensing semantic segmentation methods generally ignore the structural information of objects that is vital in the human visual recognition system. The absence of overall structural information often results in weak perceptions of subtle textures and fragmented predictions, especially for complex and variable ground object scenarios. Besides, they still suffer from the semantic ambiguity caused by the unclear object boundary features in remote sensing images. In this paper, we propose a novel remote sensing semantic segmentation framework, called CSBNet, which aims to enhance the capacity of class-guided structural interaction and boundary perception simultaneously. It consists of a class-guided structure interaction module (CSIM), a Transformer-based context aggregation module (TCAM) and a class-guided boundary supervision module (CBSM). The CSIM has the ability to progressively extract the class-specific structural features, i.e., refining the structural information of each class by iteratively exchanging information between initial coarse class tokens and contexts. Meanwhile, the TCAM is constructed to provide CSIM with more discriminative multi-scale contexts without losing spatial features. In particular, the CBSM plays an auxiliary role, which applies the boundary information obtained from the class tokens to supervise the segmentation of boundary regions. When tested on the ISPRS dataset, LoveDA dataset, UAVid dataset, our method significantly outperforms the state-of-the-art remote sensing semantic segmentation approaches.
Read full abstract