Abstract

Recent years have witnessed the exciting performance of trackers based on Transformer. However, they usually separate the process of information extraction and integration, weakening the information interaction between the target and search region. In addition, they depend on traditional Transformer to model the long range dependency, which leads to a lack of focus on the primary information needed by high-accuracy trackers. In this paper, a sparse mixed attention aggregation model is proposed for robust tracking based on visible and thermal infrared images. To be specific, a backbone network composed of sparse mixed attention is designed to achieve information extraction and integration. This is helpful to obtain specific discriminative feature information and enhance their communication. To give full play to the complementary visible and thermal information, a confidence aware aggregation network is designed, which can learn the reliable confidence of visible and thermal branches. Finally, a corner-based localization head is introduced to estimate the target state. Extensive experiments on three large-scale multimodal tracking benchmarks demonstrate the superior tracking ability of the proposed tracker over other advanced trackers.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.