Abstract

Human eye-fixation prediction in 3D images is important for many 3D applications, such as fine-grained 3D video object segmentation and intelligent bulletproof curtains. While the vast majority of existing 2D-based approaches cannot be applied, the main challenge lies in the inconsistency, or even conflict, between the RGB and depth saliency maps. In this paper, we propose a three-stream architecture to accurately predict human visual attention on 3D images end-to-end. First, a two-stream feature extraction network based on advanced convolutional neural networks is trained for RGB and depth, and hierarchical information is extracted from each ResNet-18. Then, these multi-level features are fed into the channel attention mechanism to suppress the feature space inconsistency and make the network focus on a significant target. The enhanced saliency map is fused step-by-step by VGG-16 to generate the final coarse saliency map. Finally, each coarse map is refined empirically through refinement blocks, and the network's own identification errors are corrected based on the acquired knowledge, thus converting the prediction saliency map from coarse to fine. The results of comparison of our model with six other state-of-the-art approaches on the NUS dataset (CC of 0.5579, KLDiv of 1.0903, AUC of 0.8339, and NSS of 2.3373) and the NCTU dataset (CC of 0.8614, KLDiv of 0.2681, AUC of 0.9143, and NSS of 2.3795) indicate that the proposed model consistently outperforms them by a considerable margin as it fully employs the channel attention mechanism.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call