ABSTRACTSemantic segmentation of multimodal remote sensing images is an effective approach to enhancing segmentation accuracy. Optical and synthetic aperture radar (SAR) images capture ground features from distinct perspectives, offering varied information for ground observation. Effectively fusing these two modalities information and performing multimodal segmentation remains a promising yet challenging task due to their complementary nature and significant differences. The existing methods have the problem of ignoring spatial dimension information and being unable to bridge semantic gaps. However, the proposal of Selective Structured State Space Models (Mamba) provides opportunities for multimodal fusion. Therefore, we propose a segmentation framework based on Mamba fusion of optical and SAR images. The framework introduces a novel fusion module inspired by the principle of Mamba. This module selects effective features from different modalities for cross‐fusion within a global sensing field, facilitating the mutual compensation of optical and SAR image features and reducing the semantic gap. The fused features are then accurately segmented using a decoder that incorporates the Atrous Spatial Pyramid Pooling (ASPP) technique. On the WHU‐OPT‐SAR dataset, this method outperforms other state‐of‐the‐art deep learning approaches, achieving an overall accuracy (OA) of 84.13%, compared to 76.29% for the Kappa statistic.
Read full abstract