AbstractDue to the ignoring of rich spatio-temporal and global contextual information with convolutional neural networks in features extraction, the traditional method is prone to tracking drift or even failure in complex scenario, especially for the tiny targets in aerial photography scenario. In this work, it proposes a transformer feature integration network (TFITrack) to obtain diverse and comprehensive target feature for the robust object tracking. Based on the typical transformer architecture, it optimizes encoder and decoder structure for aggregating discriminative spatio-temporal information and global context-awareness feature. Furthermore, the encoder introduces the similarity calculation layer and dual-attention module; the aim is to deepen the similarity between features and make corrections for channel and spatial dimensions, and feature representation is improved. Finally, with the introduction of the temporal context filtering layer, unimportant feature information is ignored adaptively, obtaining a balance between the parameters number reduction and stable performance. Experimental results show that the proposed tracking algorithm exhibits excellent tracking performance on seven benchmark datasets, especially on the aerial dataset UAV123, UAV20L, and UAV123@10fps, which presents the advantages of the novel method in dealing with fast motion and external interference.
Read full abstract