Abstract

The emergence of ACMix fundamentally integrates convolution and self-attention mechanisms, fully leveraging their advantages. However, it faces challenges in associating temporal sequences and struggles to achieve accurate feature sampling. Additionally, its global correlation ability makes it susceptible to interference from irrelevant information. To address these issues, we propose the Spatio-Temporal Deformable Mix Feature Extractor (STD-ME) based on ACMix. In STD-ME, we designed deformable modules for both convolution and attention branches, incorporating spatio-temporal context to enable more precise feature sampling. By integrating STD-ME into a tracker that employs multi-frame fusion, we aim to further enhance its performance. The utilization of Crop–Transform–Paste for manual data synthesis offers a novel perspective for self-supervised tracking. However, it is important to note that while this method has shown impressive results, the synthesized data lacks spatio-temporal continuity in attributes such as scale variation, rotation, illumination variation, position, and partial occlusion, which limits its alignment with real-world scenarios. Consequently, training trackers based on multi-frame fusion may face challenges in achieving significant breakthroughs. To overcome this limitation, we introduce Spatial–Temporal Transformation (STT). STT utilizes an Iterative Random Number Generator (IRNG) based on a normal distribution to probabilistically generate spatio-temporal continuous data. Finally, we conducted extensive experiments on STD-ME and STT to demonstrate the effectiveness of our proposed methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call