Switch and Refine: A Long-Term Tracking and Segmentation Framework

Xiang Xu,Jian Zhao,Furao Shen,Jianmin Wu

doi:10.1109/tcsvt.2022.3210245

Abstract

In long-term video object tracking (VOT) tasks, most long-term trackers are modified from short-term trackers, which contain more and more machine learning modules to improve their performance. However, we empirically find that more modules do not necessarily lead to better results. In this paper, we make the long-term tracking framework simple by carefully selecting the cutting-edge trackers. Specifically, we propose a new long-term VOT framework that combines the benefits of two mainstream short-term tracking pipelines, i.e., the discriminative online tracker and the one-shot Siamese tracker, with a global re-detector awakened when the target is lost. Such a framework fully exploits existing advanced works from three complementary perspectives. Experimental results show that by exploiting the capabilities of existing methods instead of designing new neural networks, we can still achieve remarkable results on seven long-term VOT datasets. By introducing a continuous adjustable speed control parameter, our tracker reaches 20+FPS with only a small performance loss. The refine module not only improves the bounding box estimations but also outputs segmentation masks, so that our framework can handle the video object segmentation (VOS) tasks by using only VOT trackers. We obtain a trade-off between time and accuracy on two representative VOS datasets by only using bounding boxes as the initial input.

Full Text