Abstract

Recently, deep learning-based methods, especially utilizing fully convolutional neural networks, have shown extraordinary performance in salient object detection. Despite its success, the clean boundary detection of the saliency objects is still a challenging task. Most of the contemporary methods focus on exclusive edge detection modules in order to avoid noisy boundaries. In this work, we propose leveraging on the extraction of finer semantic features from multiple encoding layers and attentively re-utilize it in the generation of the final segmentation result. The proposed Revise-Net model is divided into three parts: (a) the prediction module, (b) a residual enhancement module, and (c) reverse attention modules. Firstly, we generate the coarse saliency map through the prediction modules, which are fine-tuned in the enhancement module. Finally, multiple reverse attention modules at varying scales are cascaded between the two networks to guide the prediction module by employing the intermediate segmentation maps generated at each downsampling level of the REM. Our method efficiently classifies the boundary pixels using a combination of binary cross-entropy, similarity index, and intersection over union losses at the pixel, patch, and map levels, thereby effectively segmenting the saliency objects in an image. In comparison with several state-of-the-art frameworks, our proposed Revise-Net model outperforms them with a significant margin on three publicly available datasets, DUTS-TE, ECSSD, and HKU-IS, both on regional and boundary estimation measures.

Highlights

  • The word “salient” describes a property of an object or a region that attracts the human observer’s attention, for example, the object in a scene that humans would naturally focus on

  • The dataset is split into the training set, DUTS-TR having 10,553 images taken from the ImageNet DET training/val sets and the test set, and DUTS-TE having 5019 images taken from the ImageNet DET test set as well as the SUN dataset

  • The model performs efficiently, producing mean absolute error (MAE) scores of 0.033, 0.029, and 0.036 and mean F-measures of 0.900, 0.937, and 0.947 on the DUTS-TE [22], HKU-IS [23], and Extended Complex Scene Saliency Dataset (ECSSD) [24] datasets, respectively. This shows that our proposed model produces very highquality saliency maps, and that the performance is extendable over multiple datasets, which proves of the robustness of the model as well as its capacity to generalize the data provided

Read more

Summary

Introduction

The word “salient” describes a property of an object or a region that attracts the human observer’s attention, for example, the object in a scene that humans would naturally focus on. Humans quickly allocate attention to important regions in visual scenes. The visual system in humans comprises an optimal attentive mechanism capable of identifying the most relevant information from images. Understanding and modeling this visual attention or visual saliency is a fundamental research problem in psychology, neurobiology, cognitive science, and computer vision. FP originated from cognitive and psychology communities [1,2,3] and targets predicting where people look in images

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call