Unveil the potential of siamese framework for visual tracking

Xin Yang,Yong Song,Yufei Zhao,Zishuo Zhang,Chenyang Zhao

doi:10.1016/j.neucom.2022.09.028

Abstract

Most of the existing Siamese tracking methods follow the overall framework of SiamRPN, adopting its general network architecture and the local and linear cross-correlation operation to integrate search and template features, which restricts the introduction of more sophisticated structures for expressive appearance representation as well as the further improvements on tracking performance. Motivated by the recent progresses in vision Transformer and MLP, we first explore to accomplish a global, nonlinear and scale-invariant similarity measuring manner called Dynamic Cross-Attention (DCA). Specifically, template features are first decomposed along the spatial and channel dimension and then the Transformer Encoders are applied to adaptively excavate the long-range feature interdependency, producing reinforced kernels. As the kernels are successively multiplied to the search feature map, similarity scores between all the pixels on feature maps are estimated at once while the spatial scale of search features remains constant. Furthermore, we redesign each part of our Siamese network to further remedy the framework limitation with the assistant of DCA. Comprehensive experimental results on large-scale benchmarks indicate that our Siamese method realizes the efficient feature extraction, aggregation, refinement and interaction, outperforming state-of-the-art trackers.

Full Text