The rapid development of the Vision Transformer backbones has enabled the capture of feature information with global dependencies, leading to excellent performance in salient object detection tasks. However, it fails to adequately emphasize fine local features of edges, resulting in coarse and blurry edges in the final output. Therefore, this paper proposes a set prediction method for salient object detection. The approach enables the model to simultaneously consider both saliency (salient object) and edge features, achieving an end-to-end edge feature fusion. There is no need to design multiple complex branch structures and multiple training as with other methods. The model integrates random edge neighborhood sampling to enhance the recognition accuracy of local edge features in images. This approach addresses the weak perception of local features by Transformers and the issue encountered during actual training, where edge pixels typically occupy a much smaller proportion of the image compared to the background, resulting in insufficient learning of edge features by the model. The proposed end-to-end model in this paper fuses multiple features, including edge and salient objects. They are extracted within a unified framework while simultaneously outputting edge and salient object maps. Experimental results on six public datasets show that the proposed method significantly improves model performance benchmarks for detecting salient objects.
Read full abstract