Abstract
Salient object detection methods based on two-modal images have achieved remarkable success with the aid of image acquisition equipment. However, environmental factors often interfere with the Depth and Thermal maps, rendering them ineffective in providing object information. To address this weakness, we utilize the VDT dataset, which includes Visible, Depth, and Thermal images, and propose a triple-modal interaction encoder and multi-scale fusion decoder network (TMNet) to highlight the salient regions. The triple-modal interaction encoder comprises the separation context-aware feature module, channel-wise fusion module, and triple-modal refinement and fusion module, enabling us to fully explore and utilize the complementarity between Visible, Depth, and Thermal information. The multi-scale fusion decoder involves the semantic-aware localizing module and contour-aware refinement module to extract and fuse the location and boundary information, yielding a high-quality saliency map. Extensive experiments on the public VDT-2048 dataset demonstrate that our TMNet outperforms existing state-of-the-art methods in terms of all evaluation metrics.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have