Abstract

Most of the existing Siamese tracking methods follow the overall framework of SiamRPN, adopting its general network architecture and the local and linear cross-correlation operation to integrate search and template features, which restricts the introduction of more sophisticated structures for expressive appearance representation as well as the further improvements on tracking performance. Motivated by the recent progresses in vision Transformer and MLP, we first explore to accomplish a global, nonlinear and scale-invariant similarity measuring manner called Dynamic Cross-Attention (DCA). Specifically, template features are first decomposed along the spatial and channel dimension and then the Transformer Encoders are applied to adaptively excavate the long-range feature interdependency, producing reinforced kernels. As the kernels are successively multiplied to the search feature map, similarity scores between all the pixels on feature maps are estimated at once while the spatial scale of search features remains constant. Furthermore, we redesign each part of our Siamese network to further remedy the framework limitation with the assistant of DCA. Comprehensive experimental results on large-scale benchmarks indicate that our Siamese method realizes the efficient feature extraction, aggregation, refinement and interaction, outperforming state-of-the-art trackers.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.