TAG-fusion: Two-stage attention guided multi-modal fusion network for semantic segmentation

Zhizhou Zhang,Wenwu Wang,Lei Zhu,Zhibin Tang

doi:10.1016/j.dsp.2024.104807

Abstract

In the current research, leveraging auxiliary modalities, such as depth information or point cloud information, to improve RGB semantic segmentation has shown significant potential. However, existing methods mainly use convolutional modules for aggregating features from auxiliary modalities, thereby lacking sufficient exploitation of long-range dependencies. Moreover, fusion strategies are typically limited to singular approaches. In this paper, we propose a transformer-based multimodal fusion framework to better utilize auxiliary modalities for enhancing semantic segmentation results. Specifically, we employ a dual-stream architecture for extracting features from RGB and auxiliary modalities, respectively. We incorporate both early fusion and deep feature fusion techniques. At each layer, we introduce mixed attention mechanisms to leverage features from other modalities, guiding and enhancing the current modality's features before propagating them to the subsequent stage of feature extraction. After the extraction of features from different modalities, we employ an enhanced cross-attention mechanism for feature interaction, followed by channel fusion to obtain the final semantic features. Subsequently, we provide separate supervision to the network on the RGB stream, auxiliary stream, and fusion stream to facilitate the learning of representations for different modalities. The experimental results demonstrate that our framework exhibits superior performance across diverse modalities. Specifically, our approach achieves state-of-the-art results on the NYU Depth V2, SUN-RGBD, DELIVER and MFNet datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

TAG-fusion: Two-stage attention guided multi-modal fusion network for semantic segmentation

Abstract

Talk to us

Similar Papers

More From: Digital Signal Processing

Lead the way for us

Similar Papers

Cross-modal attention fusion network for RGB-D semantic segmentation
Qiankun Zhao ... Lijin Fang
Neurocomputing | VOL. 548
Qiankun Zhao, et. al.Qiankun Zhao ... Lijin Fang
25 May 2023
Neurocomputing | VOL. 548

Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation
Saurabh Gupta ... Ross Girshick
International Journal of Computer Vision | VOL. 112
Saurabh Gupta, et. al.Saurabh Gupta ... Ross Girshick
21 Nov 2014
International Journal of Computer Vision | VOL. 112

Incongruity-aware multimodal physiology signals fusion for emotion recognition
Jing Li ... Dingxin Chen
Information Fusion | VOL. 105
Jing Li, et. al.Jing Li ... Dingxin Chen
29 Dec 2023
Information Fusion | VOL. 105

IIMT-net: Poly-1 weights balanced multi-task network for semantic segmentation and depth estimation using interactive information
Mengfei He ... Huaibo Song
Image and Vision Computing | VOL. 148
Mengfei He, et. al.Mengfei He ... Huaibo Song
08 Jun 2024
Image and Vision Computing | VOL. 148

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TAG-fusion: Two-stage attention guided multi-modal fusion network for semantic segmentation

Abstract

Talk to us

Similar Papers

More From: Digital Signal Processing