The Siamese network method is widely applied in the field of object tracking. The transformer-based tracker achieves state-of-the-art tracking results. However, these methods cannot effectively fuse local and global features of video images and cannot pay more attention to tracking objects spatiotemporally. In this paper, we proposed a new object tracking method (DASFTOT), which includes a backbone network, transformer mechanism and bounding prediction box. First, we use a 3D CNN to extract motion information. Second, we superimpose important temporal and spatial information through a dual attention spatiotemporal fused transformer (DASFT) to fuse local and global spatiotemporal features and calculate the correlation between templates and search regions. Third, to improve the robustness of tracking, we dynamically update part of the template frame. Finally, we position the tracking object through a bounding prediction box. To verify the effectiveness of the proposed tracker (DASFTOT), experiments on the GOT-10K, LaSOT, TrackingNet, VOT2020 and OTB100 benchmark datasets demonstrated that the proposed tracker was highly comparable to other state-of-the-art methods.
Read full abstract