Learning Attentional Recurrent Neural Network for Visual Tracking

Qiurui Wang,Jingdong Wang,Chun Yuan,Wenjun Zeng

doi:10.1109/tmm.2018.2869277

Abstract

Existing visual tracking methods face many challenges: 1) the changed size and number of targets over time, occlusion in discrete frames, and mis-identification for crossing targets. Long short-term memory (LSTM) has the advantage of modeling long-term tasks and is suitable for tracking. We propose a novel online attentional recurrent neural network (ARNN) model for visual tracking, whose core component is a two-layer bidirectional LSTM along the $x$ - and $y$ -axes. Several bidirectional LSTMs can be cascaded or parallelly connected together to exploit multiscale target features and can give more precise tracked object locations. Each bidirectional LSTM utilizes the convolutional features of a convolutional neural network inside two bounding boxes from two frames to check whether the target in the current frame is the one in previous frames. An attention mechanism is also adopted to enhance the proposed model to better express the patch-level features of the tracking targets. Interattention and intra-attention models are proposed to imitate the temporal and spatial tracking mechanism of primate visual cortex. Interattention learns to overcome the occlusion problem, and intra-attention is able to mark important regions to better trace the target. The bidirectional LSTM and the attention mechanism are jointly trained. The combination of them further improves the accuracy of target tracking in videos. The outstanding performances in the experiments demonstrate the effectiveness of our proposed online method ARNN and yield competitive results compared with the state-of-the-art tracking methods.

Full Text