Deep Multimodal Fusion Autoencoder for Saliency Prediction of RGB-D Images.

Kengda Huang,Meixin Fang,Wujie Zhou,Anastasios D Doulamis

doi:10.1155/2021/6610997

Abstract

In recent years, the prediction of salient regions in RGB-D images has become a focus of research. Compared to its RGB counterpart, the saliency prediction of RGB-D images is more challenging. In this study, we propose a novel deep multimodal fusion autoencoder for the saliency prediction of RGB-D images. The core trainable autoencoder of the RGB-D saliency prediction model employs two raw modalities (RGB and depth/disparity information) as inputs and their corresponding eye-fixation attributes as labels. The autoencoder comprises four main networks: color channel network, disparity channel network, feature concatenated network, and feature learning network. The autoencoder can mine the complex relationship and make the utmost of the complementary characteristics between both color and disparity cues. Finally, the saliency map is predicted via a feature combination subnetwork, which combines the deep features extracted from a prior learning and convolutional feature learning subnetworks. We compare the proposed autoencoder with other saliency prediction models on two publicly available benchmark datasets. The results demonstrate that the proposed autoencoder outperforms these models by a significant margin.

Highlights

With the rapid development in the consumer electronic industry, various RGB-D applications and services have become increasingly popular for enhanced user experience [1,2,3,4,5,6]. e RGB-D image processing technologies for RGB-D applications and services can be further improved by developing better models of RGB-D perception [7,8,9,10]
We propose novel convolutional neural networks (CNNs) for RGB-D image saliency prediction in a deep supervision manner. e proposed autoencoder comprises four main networks: a color channel network, a disparity channel network, a feature concatenated network, and a feature learning network. e autoencoder ensures that the networks are trainable in an end-to-end deep design and can automatically learn their own priors from training data. e results indicate that the proposed deep autoencoder, by incorporating a disparity channel network and a prior learning subnetwork, helps significantly improve the prediction performance
The model optimally learns a combination of features extracted from a prior learning subnetwork and a convolutional feature learning subnetwork and applies it to predict the saliency maps. e effectiveness of each component was validated through extensive evaluations. e quantitative and qualitative comparisons with other models on two benchmark datasets indicate the efficiency and effectiveness of our deep autoencoder for the saliency prediction of RGB-D images

Summary

Introduction

With the rapid development in the consumer electronic industry, various RGB-D applications and services have become increasingly popular for enhanced user experience [1,2,3,4,5,6]. e RGB-D image processing technologies for RGB-D applications and services can be further improved by developing better models of RGB-D perception [7,8,9,10]. Ma and Huang presented a learning-based RGB-D saliency prediction model that includes the depth map and its derived features [62]. Zhang et al introduced an RGB-D image saliency prediction model based on deep learning techniques. E results indicate that the proposed deep autoencoder, by incorporating a disparity channel network and a prior learning subnetwork, helps significantly improve the prediction performance. (3) e results indicate that the proposed deep autoencoder, by incorporating a disparity channel network and learned priors, helps significantly improve the prediction performance and achieve outstanding results with competitive generalization properties

Proposed Autoencoder

Experimental Results

Conclusion and Future Work