Exploiting temporal coherence for self-supervised visual tracking by using vision transformer

Wenjun Zhu,Zuyi Wang,Li Xu,Jun Meng

doi:10.1016/j.knosys.2022.109318

Abstract

Deep learning based fully-supervised visual trackers entail the requirement of large-scale and frame-wise annotation that needs a laborious and tedious data annotation process. To reducing the amount of labeled efforts, a self-supervised learning framework, the ETC, is proposed in this work that exploits temporal coherence as a self-supervised signal and uses visual transformer to capture the relationship among the unlabeled video frames. We design a cycle-consistent transformer architecture to cast self-supervised tracking as cycle prediction problems. With carefully-designed and targeted configurations for cycle-consistent transformer including temporal sampling strategies, tracking initialization and data augmentation, our approach is applicable for two tracking settings, i.e., the unlabeled sample (ULS) scene and the few labeled sample (FLS) scene. To learn richer and more discriminative representations, we not only utilize the inter-frame correspondence, but also conduct the intra-frame correspondence to effectively model the target-to-frame and long-range correspondence. Extensive experiments are conducted on the popular benchmark datasets OTB2015, VOT2018, UAV123, TColor-128, NFS and LaSOT, and the results show that our approach achieves competitive results in the ULS setting, and supplies a trade-off between performance and annotation cost in the FLS setting.

Full Text