Abstract
Deep learning based fully-supervised visual trackers entail the requirement of large-scale and frame-wise annotation that needs a laborious and tedious data annotation process. To reducing the amount of labeled efforts, a self-supervised learning framework, the ETC, is proposed in this work that exploits temporal coherence as a self-supervised signal and uses visual transformer to capture the relationship among the unlabeled video frames. We design a cycle-consistent transformer architecture to cast self-supervised tracking as cycle prediction problems. With carefully-designed and targeted configurations for cycle-consistent transformer including temporal sampling strategies, tracking initialization and data augmentation, our approach is applicable for two tracking settings, i.e., the unlabeled sample (ULS) scene and the few labeled sample (FLS) scene. To learn richer and more discriminative representations, we not only utilize the inter-frame correspondence, but also conduct the intra-frame correspondence to effectively model the target-to-frame and long-range correspondence. Extensive experiments are conducted on the popular benchmark datasets OTB2015, VOT2018, UAV123, TColor-128, NFS and LaSOT, and the results show that our approach achieves competitive results in the ULS setting, and supplies a trade-off between performance and annotation cost in the FLS setting.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.