Adaptive decision-level fusion and complementary mining for visual object tracking with deeper networks

Xiaoyan Meng,Le Xin,Yangzhou Chen

doi:10.1117/1.jei.29.4.043024

Abstract

Multiple region proposal networks (RPNs) have been recently combined with the Siamese network with deeper backbone networks for tracking and shown excellent accuracy with high efficiency. Although the destruction of the strict translation invariance caused by network padding in the original ResNet-50 is solved by a custom sampling strategy, its impact is not eliminated from the network structure itself, and the multilayer feature fusion is insufficient. To this end, we propose an object tracking framework based on SiamRPN with the deeper backbone networks and cascaded RPN (D-CRPN). First, we exploit the cropping-inside residual units for reforming ResNet-50 to break the spatial invariance restriction and train the robust backbone networks for visual tracking. Then, the feature transfer blocks are proposed to achieve the effective integration of the outputs of multiple blocks in a specific network stage. Finally, to improve the robustness of our tracker, we present a quality measure for the synthetic response maps of RPN modules and then use it to calculate the adaptive weights for the linear weighting method. The extensive evaluation performed on OTB100, VOT2016, and VOT2018 benchmark datasets demonstrates that the proposed D-CRPN tracker outperforms most of the state-of-the-art approaches while maintaining real-time tracking speed.

Full Text