Recently, semantic segmentation based on RGB and thermal infrared (TIR) images has become a research hotspot because of its stability in the weak light environment. However, most of the current methods ignore the differences between the two modalities of data and do not use semantic information in multi-modal fusion. In this paper, we propose a novel Semantic-Guided Fusion Network (SGFNet) for RGB-Thermal semantic segmentation, which makes full use of semantic information in the multi-modal fusion. Our SGFNet consists of an asymmetric encoder with TIR branch and RGB branch and a decoder. We concentrate on enhancing the multi-modal feature representation in the encoder with a pattern of fusion and enhancement. Specifically, considering that TIR images are stable under weak light conditions, we first propose a Semantic Guidance Head to extract semantic information in the TIR branch. In the RGB branch, we propose a Multi-modal Coordination and Distillation Unit to fuse multi-modal features first. Then, we propose a Cross-level and Semantic-guided Enhancement Unit to enhance the fused features with cross-level information and semantic information. We arrange these two units at all stages of the RGB branch to generate features with strong representation abilities at different levels. For the decoder, to obtain large receptive fields and fine edges, we improve the Lawin ASPP decoder by introducing edge information extracted from the low-level features, proposing the edge-aware Lawin ASPP decoder. With our encoder and decoder working together, our SGFNet can identify objects accurately and segment objects finely. Extensive experiments on the MFNet dataset demonstrate the superior performance of the proposed SGFNet compared with state-of-the-art methods. The code and results of our method are available at https://github.com/kw717/SGFNet.
Read full abstract