Abstract

Visual saliency prediction for RGB-D images is more challenging than that for their RGB counterparts. Additionally, very few investigations have been undertaken concerning RGB-D-saliency prediction. The proposed study presents a method based on a hierarchical multimodal adaptive fusion (HMAF) network to facilitate end-to-end prediction of RGB-D saliency. In the proposed method, hierarchical (multilevel) multimodal features are first extracted from an RGB image and depth map using a VGG-16-based two-stream network. Subsequently, the most significant hierarchical features of the said RGB image and depth map are predicted using three two-input attention modules. Furthermore, adaptive fusion of saliencies concerning the above-mentioned fused saliency features of different levels (hierarchical fusion saliency features) can be accomplished using a three-input attention module to facilitate high-accuracy RGB-D visual saliency prediction. Comparisons based on the application of the proposed HMAF-based approach against those of other state-of-the-art techniques on two challenging RGB-D datasets demonstrate that the proposed method outperforms other competing approaches consistently by a considerable margin.

Highlights

  • Saliency prediction, wherein the objective is to automatically predict what most attracts human attention in view-free scenarios, is a long-standing classical research topic concerning visual-cognition, computer sciences [1, 2], and imaging techniques [3,4,5,6]

  • To verify the prediction performance of our proposed hierarchical multimodal adaptive fusion (HMAF) network, all saliency prediction methods were applied to various datasets

  • There are very few public RGB-D datasets for performing visual saliency prediction research. is study evaluates the performance of saliency-prediction methods applied to two representative datasets—NUS3D [35] and NCTU [36], as follows: (1) NUS3D contains 600 RGB-D images and involves several 2D and RGB-D view scenes; it provides depth images, RGB stimuli, and 2D and RGB-D fixation maps; (2) NCTU comprises 475 RGB-D images along with depth maps; this dataset includes various scenes, most of which have been adapted from existing stereo movies and videos

Read more

Summary

Introduction

Wherein the objective is to automatically predict what most attracts human attention in view-free scenarios, is a long-standing classical research topic concerning visual-cognition, computer sciences [1, 2], and imaging techniques [3,4,5,6]. RGB-D saliency prediction has attracted increased attention in recent years [24,25,26,27,28], and extant studies have concentrated on the design of depth-induced RGB-D saliency prediction methods [29,30,31] In most such methods, fusion methods are inadequate to combine complementary features obtained from an RGB image and depth map. To facilitate the appropriate fusion of features obtained from the RGB image and depth map, a hierarchical multimodal adaptive fusion (HMAF) network for RGB-D saliency prediction is proposed. A three-input attention module is used to fuse the hierarchical fusion saliency features adaptively, thereby facilitating accurate RGB-D visual saliency prediction. 0.333 1.560 0.795 1.209 0.542 0.674 0.806 1.264 e evaluation results of various saliency prediction models

Proposed Network Architecture
Implementation Detail
Experimental Results and Analysis
Ablation Studies
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call