Hierarchical multi-modal fusion FCN with attention model for RGB-D tracking

Ming-Xin Jiang,Chao Deng,Jing-Song Shan,Yuan-Yuan Wang,Yin-Jie Jia,Xing Sun

doi:10.1016/j.inffus.2018.09.014

Abstract

In this paper, we propose a RGB-D tracking algorithm, built upon hierarchical multi-modal fusion fully convolutional neural network (FCN) with attention model. First, we encode the depth images into three channels using the HHA representation to obtain the similar structure to the RGB images. Second, a multi-modal fusing features learning FCN with attention model is constructed, which can extract hierarchical multi-modal fusing features of the samples in RGB-D data. The attention model is adopted to exploit the importance weight of RGB and depth to fuse the features of the two modalities at multiple layers effectively, rather than concatenating the feature vectors of the two channels simply. Finally, the hierarchical multi-modal fusing features of the samples are input to the Efficient Convolution Operators (ECO) tracker, and the update strategy is improved by the occlusion detection in the depth images. Experimental results datasets demonstrate that the proposed RGB-D tracker achieves new state-of-art performance on the large-scale Princeton RGB-D Tracking Benchmark (PTB) dataset and the University of Birmingham RGB-D Tracking Benchmark (BTB).

Full Text