Multiple Object Tracking (MOT) is a task for containing detection and association. Plenty of trackers have achieved competitive performance. Unfortunately, for the lack of informative exchange on these subtasks, they are often biased toward one of the two and underperform in complex scenarios, such as the inevitable misses and mistaken trajectories of targets, when tracking individuals within a crowd. This paper proposes TransFiner, a transformer-based approach to post-refining MOT. It is a generic attachment framework that depends on query pairs, the bridge between an original tracker and TransFiner. Each query pair, through the fusion decoder, produces refined detection and motion clues for a specific object. Before that, they are feature-aligned and group-labeled under the guidance of tracking results (locations and class predictions) from the original tracker, finishing tracking refinement with focus and comprehensively. Experiments show that our design is effective, on the MOT17 benchmark, we elevate the CenterTrack from 67.8% MOTA and 64.7% IDF1 to 71.5% MOTA and 66.8% IDF1. The code is publicly available at https://github.com/BeenoSun/TransFiner.
Read full abstract