Recent years have witnessed the exciting performance of trackers based on Transformer. However, they usually separate the process of information extraction and integration, weakening the information interaction between the target and search region. In addition, they depend on traditional Transformer to model the long range dependency, which leads to a lack of focus on the primary information needed by high-accuracy trackers. In this paper, a sparse mixed attention aggregation model is proposed for robust tracking based on visible and thermal infrared images. To be specific, a backbone network composed of sparse mixed attention is designed to achieve information extraction and integration. This is helpful to obtain specific discriminative feature information and enhance their communication. To give full play to the complementary visible and thermal information, a confidence aware aggregation network is designed, which can learn the reliable confidence of visible and thermal branches. Finally, a corner-based localization head is introduced to estimate the target state. Extensive experiments on three large-scale multimodal tracking benchmarks demonstrate the superior tracking ability of the proposed tracker over other advanced trackers.
Read full abstract