The current paradigm of joint detection and tracking still requires a large amount of instance-level trajectory annotation, which incurs high annotation costs. Moreover, treating embedding training as a classification problem would lead to difficulties in model fitting. In this paper, we propose a new self-supervised multi-object tracking based on the real-time joint detection and embedding (JDE) framework, which we termed as self-supervised multi-object tracking (SS-MOT). In SS-MOT, the short-term temporal correlations between objects within and across adjacent video frames are both considered as self-supervised constraints, where the distances between different objects are enlarged while the distances between same object of adjacent frames are brought closer. In addition, short trajectories are formed by matching pairs of adjacent frames using a matching algorithm, and these matched pairs are treated as positive samples. The distances between positive samples are then minimized for futher the feature representation of the same object. Therefore, our method can be trained on videos without instance-level annotations. We apply our approach to state-of-the-art JDE models, such as FairMOT, Cstrack, and SiamMOT, and achieve comparable results to these supevised methods on the widely used MOT17 and MOT20 challenges.
Read full abstract