Abstract

The fusion of RGB and thermal data for dense prediction tasks has been demonstrated to be an effective and robust approach in autonomous driving. Nevertheless, the challenge lies in fusing features from different modalities, as a simple fusion strategy may lead to redundant or conflicting semantic information. In this paper, we delve into the hierarchical connections between multi-modal features and propose a novel fusion paradigm termed DHFNet, which decouples multi-modal features into similar global long-distance and discrepant local detail features for hierarchical feature fusion. At each fusion stage, a Lightweight Global Self-attention (LGSA) module is designed to decouple the global long-distance features with cheap computational complexity cost, and a Cross Modal Long-distance Feature Fusion (CMLFF) module is designed to eliminate redundant features by facilitating information interaction between different modalities. During the process of decoupling local detail features, a Cross Modal Deformable Convolution (CMDC) module is proposed to dynamically extract effective local features and capture misaligned features between different modalities. Finally, the fused global long-distance and local detail features are recoupled to achieve efficient hierarchical fusion. The results on RGB-T semantic segmentation and object detection tasks demonstrate the effectiveness of proposed method. The code will be available at: https://github.com/donggaomu/DHFNet.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call