Supervised RGBT (SRGBT) tracking tasks need both expensive and time-consuming annotations. Therefore, the implementation of Self-Supervised RGBT (SSRGBT) tracking methods has become increasingly important. Straightforward SSRGBT tracking methods use pseudo-labels for tracking, but inaccurate pseudo-labels can lead to object drift, which severely affects tracking performance. This article proposes a self-supervised RGBT object tracking method (S2OTFormer) to bridge the gap between tracking methods supervised under pseudo-labels and ground truth labels. Firstly, to provide more robust appearance features for motion cues, we introduce a multi-modality hierarchical transformer (MHT) module for feature fusion. This module allocates weights to both modalities and strengthens the expressive capability of the MHT module through multiple nonlinear layers to fully utilize the complementary information of the two modalities. Secondly, in order to solve the problems of motion blur caused by camera motion and inaccurate appearance information caused by pseudo-labels, we introduce a motion-aware mechanism (MAM). The MAM extracts the average motion vectors from the previous multi-frame search frame features and constructs the consistency loss with the motion vectors of the current search frame features. The motion vectors of inter-frame objects are obtained by reusing the inter-frame attention map to predict coordinate positions. Finally, to further reduce the effect of inaccurate pseudo-labels, we propose an Attention-Based Multi-Scale Enhancement Module. By introducing cross-attention to achieve more precise and accurate object tracking, this module overcomes the receptive field limitations of traditional CNN tracking heads. We demonstrate the effectiveness of S2OTFormer on four large-scale public datasets through extensive comparisons as well as numerous ablation experiments. The source code is available at https://github.com/LiShenglana/S2OTFormer .
Read full abstract