We propose the multilayer filter fusion network (MFFN) to address the problem of visual object tracking. In MFFN, the convolutional neural network (CNN) is used to extract the multilayer spatial features and then the convolutional long short-term memory (LSTM) to extract the temporal features of images. The object image centered at the target is cropped and fed into MFFN to obtain the correlation filter and the feature map to discriminate the target from background. The correlation filter is convolved with the corresponding feature map for the same layer to produce the probability map, which is then used to estimate the target position by searching its maximum value. The correlation filter corresponds to the tracked object image that is fed into MFFN and thus contains the appearance changes of target. In our multilayer filter fusion tracking (MFFT) framework, we use two MFFNs with different inputs to track the target via coarse-to-fine location approach. The first one is used to estimate the target position from the entire image and the second one to locate the target from the estimated target position. After the networks are trained off-line they do not require online learning during tracking. Experimental results on the CVPR2013 benchmark demonstrate that our tracking algorithm achieves competitive results compared with other tracking methods.
Read full abstract