Abstract

Global context and global contrast are crucial clues for Salient Object Detection (SOD) in images. Most advanced SOD methods exploit CNN-based architectures, achieving impressive results. However, these methods have intrinsic limitations in capturing long-range global information since a CNN extracts feature in local sliding windows. In contrast, transformers exploit a self-attention mechanism to extract features, gaining a powerful capability of learning global cues. Nonetheless, a pure transformer-based network consumes a large computational overhead and easily suffers from attention collapse, as it goes deeper. To address this issue, in this paper, we propose a Transformer-based Hierarchical Dynamic Decoder (T-HDDNet) for image salient object detection. Specifically, our T-HDDNet employs the transformer to encode each image patch into multi-level and multi-resolution features based on the long-range dependencies among pixels. To obtain an accurate saliency map of high resolution, we develop a dynamic dual upsampling mechanism to enlarge feature spatial size in a data-driven manner, together with a dynamic feature fusion unit. Ultimately, the hierarchical dynamic decoders built on the basis of these two units are used to attain the final saliency progressively. Extensive experimental results show that the proposed method achieves the best performance on all benchmarks, in comparison with state-of-the-art technologies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call